R-language crawler-RCurl and rcurl Crawler
# RCurl author ##
Duncan Temple Lang
Associate Professor at University of California U. C. Davis
We are committed to exploring information technology through statistical integration.
RCurl Overview
The RCurl package is an R-interface to the libcurl library that provides HTTP
Facilities. This allows us to download files from Web servers, post forms, use
HTTPS (the secure HTTP), use persistent connections, upload files, use binary
Content, handle redirects, password authentication, etc.
The RCurl package provides interfaces from R to the libcurl library to implement some HTTP functions. For example
The server downloads files, maintains connections, uploads files, reads in binary format, redirects handles, and performs password authentication.
What is curl & libcurl?
-Curl: an open source file transmission tool that uses the URL syntax in the command line mode.
-The library behind curl is libcurl.
Function
-Obtain the page
-Authentication
-Upload and download
-Information Search
-......
HTTP protocol
Protocol refers to the regulations or rules that must be followed by two computers in a computer communication network. Hypertext Transfer Protocol (HTTP) is a communication protocol, it allows the transfer of Hypertext Markup Language (HTML) documents from the Web server to the client's browser
Currently, HTTP/1.1 is used.
1. URL details
Basic Format: schema: // host [: port #]/path /... /[? Query-string] [# anchor]
Scheme specifies the protocol used at the lower layer (for example, http, https, and ftp)
IP address or domain name of the host HTTP Server
Port # The default port number of the HTTP server is 80. In this case, the port number can be omitted.
Path
Data sent from query-string to the http server
Anchor-anchor
2. request
Request Line, request header, Message Body
Method indicates the request Method, such as "GET", "POST", "HEAD", and "PUT ".
Path-to-resource indicates the requested resource.
Http/version-number indicates the HTTP protocol version number.
Request Header
Host server address
Acceptable media types on the Accept browser, text/html
The encoding method received by the Accept-encoding browser, usually referred to as the compression method
The Accept-language browser declares the language it receives
The User-agent informs the Server client of the operating system and browser version.
The most important component of the Cookie request header. Data (usually encrypted) stored on the user's local terminal to identify the user and track the session)
Referer jump page
Connection status of the client and server
3. response
Status line, message header, response body
HTTP/version-number indicates the HTTP protocol version number.
Status-code and message indicate the status code and status information.
Status-code)
The status code is used to tell the HTTP client whether the HTTP server has produced the expected Response.
HTTP/1.1 defines five types of status codes, which are composed of three digits. The first number defines the response class.
Other
-1XX message-indicates that the request has been successfully received and processed.
-2XX success-indicates that the request has been successfully received, understood, and accepted
-3XX redirection-further processing is required to complete the request
-4XX client error-the request has a syntax error or the request cannot be implemented
-5XX Server Error-the server fails to implement valid requests
Message Header
Server software information, such as nginx
Date response Date
Last-Modified Last modification time
The Content-type server tells the browser the type of the object to respond to, text/html
Whether the Connection server and client are connected
X-Powered-By indicates the technology developed By the website, such as PHP.
The Length of bytes returned by the Content-Length request.
The most important header in the Set-Cookie response, which is used to send the cookie to the corresponding browser. Each cookie written generates a set-cookie.
RCurl Functions
GetURL ()
GetForm ()
PostForm ()