R-language crawler-RCurl and rcurl Crawler

Source: Internet
Author: User

R-language crawler-RCurl and rcurl Crawler

# RCurl author ##
Duncan Temple Lang
Associate Professor at University of California U. C. Davis
We are committed to exploring information technology through statistical integration.

RCurl Overview

The RCurl package is an R-interface to the libcurl library that provides HTTP
Facilities. This allows us to download files from Web servers, post forms, use
HTTPS (the secure HTTP), use persistent connections, upload files, use binary
Content, handle redirects, password authentication, etc.

The RCurl package provides interfaces from R to the libcurl library to implement some HTTP functions. For example
The server downloads files, maintains connections, uploads files, reads in binary format, redirects handles, and performs password authentication.

What is curl & libcurl?
-Curl: an open source file transmission tool that uses the URL syntax in the command line mode.
-The library behind curl is libcurl.

Function
-Obtain the page
-Authentication
-Upload and download
-Information Search
-......

HTTP protocol

Protocol refers to the regulations or rules that must be followed by two computers in a computer communication network. Hypertext Transfer Protocol (HTTP) is a communication protocol, it allows the transfer of Hypertext Markup Language (HTML) documents from the Web server to the client's browser

Currently, HTTP/1.1 is used.

1. URL details
Basic Format: schema: // host [: port #]/path /... /[? Query-string] [# anchor]
Scheme specifies the protocol used at the lower layer (for example, http, https, and ftp)
IP address or domain name of the host HTTP Server
Port # The default port number of the HTTP server is 80. In this case, the port number can be omitted.
Path
Data sent from query-string to the http server
Anchor-anchor
2. request
Request Line, request header, Message Body

Method indicates the request Method, such as "GET", "POST", "HEAD", and "PUT ".
Path-to-resource indicates the requested resource.
Http/version-number indicates the HTTP protocol version number.

Request Header
Host server address
Acceptable media types on the Accept browser, text/html
The encoding method received by the Accept-encoding browser, usually referred to as the compression method
The Accept-language browser declares the language it receives
The User-agent informs the Server client of the operating system and browser version.
The most important component of the Cookie request header. Data (usually encrypted) stored on the user's local terminal to identify the user and track the session)
Referer jump page
Connection status of the client and server
3. response
Status line, message header, response body

HTTP/version-number indicates the HTTP protocol version number.
Status-code and message indicate the status code and status information.
Status-code)
The status code is used to tell the HTTP client whether the HTTP server has produced the expected Response.
HTTP/1.1 defines five types of status codes, which are composed of three digits. The first number defines the response class.
Other
-1XX message-indicates that the request has been successfully received and processed.
-2XX success-indicates that the request has been successfully received, understood, and accepted
-3XX redirection-further processing is required to complete the request
-4XX client error-the request has a syntax error or the request cannot be implemented
-5XX Server Error-the server fails to implement valid requests

Message Header
Server software information, such as nginx
Date response Date
Last-Modified Last modification time
The Content-type server tells the browser the type of the object to respond to, text/html
Whether the Connection server and client are connected
X-Powered-By indicates the technology developed By the website, such as PHP.
The Length of bytes returned by the Content-Length request.
The most important header in the Set-Cookie response, which is used to send the cookie to the corresponding browser. Each cookie written generates a set-cookie.

RCurl Functions

GetURL ()
GetForm ()
PostForm ()

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.