For many quantitative analysis, we often consider factors such climate or economics data. Those data are usually displayed as tables on the websites and can be collected from there. However, to copy-paste with EXCEL is an extremely inefficient way. We can use R to read data directly from the web pages and store them uniformly without too may data clean steps. To be clarified, most of the statistical models are not instantly renewed. The “instant” here means along with the data increase by time, the model need to be / can be renewed in the same time or in a short time.
Our task is to realize a semi-automated web data scraping program, whenever a modeling refresh is needed, the raw data should be prepared. So we need to realize 2 things:
In XML package, there is a function which enable R to directly read tables from website (when it really contains tables). Let’s take a look on this function first:
readHTMLTable(doc, header = NA,
colClasses = NULL, skip.rows = integer(), trim = TRUE,
elFun = xmlValue, as.data.frame = TRUE, which = integer(),
readHTMLTable() inherits read.table() in base package. You can specify more arguments than what you see here if you are familiar with the latter function. doc requires a HTML document - either a file name or a URL. To simplify it, we can directly assign a URL to it. which an integer vector identifying which tables to return from within the document. This applies to the method for the document, not individual tables. To judge the number of tables, you could inspect the elements of the website and count the number of <table> tags. We use “http://en.tutiempo.net/climate/“ for gathering the climate data. The data in this website is on city basis. By selecting a city, year and month (i.e. China -> Beijing -> 2015 -> jun), you can see a URL like “http://en.tutiempo.net/climate/06-2015/ws-545110.html“. Open this page, there are 2 tables, one for climate data and one for data field interpretation. Right click on that page, select Inspect elements, it’s also not hard to find 2 <table> tags. To read the first table, specify the table number which = 1:
Using this function in embedded loops you can get the full set of data (Suppose we assign the data to climate, we will use it later.).
Stock Index Data Collection - quantmod::getSymbols
In quantmod package, getSymbols() is well developed function. Specifying the index symbol, the function automatically returns the table read from Yahoo Finance (you can also set other website), and save it in the a variable named by the symbol. You can also specify the data start and end date.
# the list of major APAC stock index code
sym_stock <- c(`AUSTRALIA` = "^AORD", # All Ordinaries
`CHINA` = "^SSEC", # Shanghai Composite
`INDIA` = "^BSESN", # Bombay BSE SENSEX / BSE 30
`INDONESIA` = "^JKSE", # Jakarta Composite
`JAPAN` = "^N225", # Tokyo Share Nikkei 225
`KOREA` = "^KS11", # Seoul Composite
`MALAYSIA` = "^KLSE", # FTSE Bursa Malaysia KLCI
`NEW ZEALAND` = "^NZ50", # New Zealand NZX 50 Index