Rcurl
==rcurl, XML, Rspython, Rmatlab
Personal homepage: http://anson.ucdavis.edu/~duncan/
(a) What is curl
Curl: The New Century File transfer tool that works with URL syntax in the command line mode
The library behind Curl is libcurl.
Features: Get page, about authentication, upload download, information search
(b) The HTTP protocol is currently using http/1.1
It allows Hypertext Markup Language (HTML) documents to be routed from the Web server to the client's browser
(iii) Three major functions of Rcurl
1.install.packages ("Rcurl")2.
- GETURL ()
- GetForm ()
- Postform ()
View related information with Geturl url.exists ()
(1)
Experiment Code:
(2)
(3)
Experiment:
(4)
MyHeader <-C (
"User-agent" = "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9.1.6) ",
"Accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-language" = "en-US",
"Connection" = "keep-alive",
"Accept-charset" = "gb2312,utf-8;q=0.7,*;q=0.7")
Note:
And
(5)
(iii) curl part parameter setting
(d) Xmlparse
Required Installation: XML
Tables, web Nodes
Parsing function Xmlparse for standard XML files
Parsing functions for HTML htmltreeparse
Download form:
Five
Capturing seismic data:
URL <-"http://data.earthquake.cn/datashare/datashare_more_quickdata_new.jsp";
WP <-GetURL (URL)
Doc <-htmlparse (wp,astext = TRUE)
Tables <-readhtmltable (doc,header=f)
Parameter which:
Error::
(vi) Xpath
Comprehensive Example: R fetching CSDN data
Crawler<-function (url,xpath,content= "text") {num_url<-length (URL) result<-data.frame (url=0,vari=0) i< -1# records the first few URLs tmp<-1# for (i_url in URL) {i_url_parse<-htmlparse (i_url,encoding= "UTF-8") #读取url网页数据 and uses HTML Parse transformation. (XML file using Xmlparse) Node<-getnodeset (I_url_parse,xpath) #通过xpath找到相应变量的xpath结点 if (Length (node) ==0) {#未爬取到数据, say Ming XPath error result[tmp,1]<-i result[tmp,2]<-na print ("Note: The variable could not be found in the", I, "page, we will put the Data write null value ") Tmp<-tmp+1}else{for (j in 1:length (node)) {result[tmp,1]< -I if (content== "text") {#欲爬取变量的内容 result[tmp,2]<-xmlvalue (Node[[j]])}e lse{#欲爬取变量的属性 Print (Node[[j]) result[tmp,2]<-xmlgetattr (node[[j]],content) } tmp<-tmp+1}} i<-i+1} result}url1<-"Http://www.cs dn.net/tag/"XPath<-"//div[@class = ' overflow ']/a" content<-"text" Crawler (url1,xpath,content)
Attention:
1. Capturing Seismic data
2. Multi-threaded Crawler
1. Management issues with crawl tasks:
A list of crawled URLs needs to be sorted into tasks, and different tasks are handled by different R programs
The processing status of the task requires R to be updated to the task's maintenance list
When a task hangs, you need to be able to read the point of the last task crawl, the database
If the volume is large: you can use Redis
2. Crawl program, need to write R program script
Linux:crontab
Windows: Scheduling Tasks
3. Crawling Dynamic IP Issues
1. Crawler's dynamic agent
1. First get an IP pool, Baidu search IP Agent
2. In the program, the IP properties of the dynamic HTTP Exchange
The process of crawling should be kept at an interval of 4 seconds
R-Big Data analytics Mining (2-r crawler)