Another difficulty in capturing R-language network data has finally been cracked !, Captured and cracked

Source: Internet
Author: User
Tags tojson

Another difficulty in capturing R-language network data has finally been cracked !, Captured and cracked


Author: Du Yu, EasyCharts team member, column author of the Chinese community in R language. Interest: Excel Business Chart, R language data visualization, and geographical information data visualization. Personal public account: data cube (ID: datamofang), founder of "data cube.



For previous case data, go to GitHub:
Https://github.com/ljtyduyu/DataWarehouse/tree/master/File

Simply from the logic of data capture (not to mention the available frameworks in those projects), I personally think that the current request library in the R language is, RCurl and httr can fully mark urllib and reuqests in Python (of course, they are more professional in the error handling and parsing framework in py !).

We often use the following network data capturing requirements:

  • Either forge a browser request

  • Either drive the browser request

For browser Request Forgery, although there are many types of requests in the request definition, crawlers actually only use GET requests and POST requests.

The driver browser has almost no threshold. What you see is what you get. RSelenium/Rwebdriver in R language and Selenium in Python can be completed (configuration is more troublesome ).

Intern monk recruitment web crawler data visualization

The GET request parameters can be written in URLs. However, when there are many parameters, directly spelling URLs is not elegant, while RCurl and httr provide optional GET request submission methods. In RCurl, getURL is usually used to complete GET requests without parameters (or directly put parameters in URLs), while getForm () is usually used to complete GET requests with parameters. (The parameters are written in the param parameter body ).

R and right-hand Pyhon series-ququ live course capturing practice

R-language crawler practice-data crawling of zhihu live Course

The GET function in httr also completes the GET request. The query parameter is used as the specified request parameter submission method (you can also choose to write it in the URL ).

For POST requests, as a common API Request Method (Some APIs are sent through GET requests), POST requests are often complex and their query parameters must be included in the request body) and the specified encoding method (content-type in request header) is required before the parameter is sent ).

Generally, there are four encoding methods:

  • Application/x-www-form-urlencoded

  • Application/json

  • Multipart/form-data

  • Text/xml

If you want to understand these four methods in depth, you can refer to the following two articles, or go to the professional http protocol and browser-related content.

Http://www.cnblogs.com/111testing/p/6079565.html
Https://bbs.125.la/thread-13743350-1-1.html


For the above four parameters, I have only used the first two parameters. The third one needs to upload files, which is not encountered yet. The fourth one is rare. In the POST function of the RCurl package, only explicit parameter declarations are made for the first and third types.
Style = httppost, post, but not listed in the second and fourth style parameters. In terms of parameter processing, httr is very friendly. The preceding four methods are directly specified:

R-right-hand Python series for left-hand users-simulate login to educational administration system

R language crawler practice-Netease cloud classroom data analysis course section data crawling

You know, in the web Front-end, the api that uses json as the returned data packet is too common. This problem has always plagued me, I even thought that the POST METHOD OF THE RCurl package does not support uploading json parameters (but RCurl is directly connected to the general crawler C language library liburl, which is also urllib, and RCurl is used at the underlying layer of httr, the RCurl that httr can do is a natural alternative ).

It must be because the author hides the json parameter upload method, or has not been able to encapsulate it into an advanced function, which is placed at the underlying layer. Otherwise, the explanation will fail. Today, I browsed a small article written by a great god on mongolin. Suddenly, I tried it and it was a success! After verifying the previous ideas, it is possible that json is not mainstream when RCurl is just launched. Therefore, json parameter passing is not obviously placed in the POST method parameter of style. The httr package makes it easy to declare the encoding methods of all POST parameters (the great god of Hadley is a quick step for the Benefit of Mankind ). Http://www.linkedin.com/pulse/web-data-acquisition-structure-rcurl-request-part-2-roberto-palloni

The following is the topic of this Article. We will share with you the case and code for constructing POST requests using the RCurl package and submitting json string parameters. Compared with httr, The RCurl library is partial to the underlying layer, with many and complex functions. httr is more agile, lightweight, and concise. This relationship is like urllib and request in Python.

Header construction and query parameters:

library("RCurl")
library("jsonlite")
library("magrittr")headers<-c(  "User-Agent"="Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36",  "Content-Type"="application/json;charset=UTF-8",  "Origin"="http://study.163.com",  "edu-script-token"="3de0ac05e45a4c9693f2a3605fbfaede")Cookies<- 'EDUWEBDEVICE=0e88ed7a9b8b4fc4bfc615e251aa8863; _ntes_nnid=833311d30eccaa9f26425affae2cfef1,1509630974190; _ntes_nuid=833311d30eccaa9f26425affae2cfef1; STUDY_CP_ENTRANCE_CLOSE=1; STUDY_MIND_TELBIND_CLOSE=1; promTips=1; NTESSTUDYSI=3de0ac05e45a4c9693f2a3605fbfaede; STUDY_INFO=oP4xHuM9V1uY_TRID1ZzHOm5mkJY|6|12769628|1510452481337; STUDY_SESS="ryW1kbDYUPDpeexx7xnRK3cGH1nUulhhGJ9D1qIzrgCuSgHl96KRz9xVAwMO6jui9A7L4BJ6hUaaPXVtLf+pWBMHJEw6UtuJtXPjl2yhevq6wVYQtAaJ+N+2maA3kITUrUnNZE+nE0TmYBoQFSEgOKEjUu22o4FQQjD/GeyPrqJsX8TS4muhOF/kd9duGihHABYarJu/gx1XyYYhTqa89g=="; STUDY_PERSIST="8e2epkzTa+8Xy6agY2FPkUzktd9F+8CZ1rZxShzQSSRJ6RbRK2pSzoTqPic7hOB7dYsCtyfpIAD9Ue4S1oRerMBML+fd8iksmANh7THsUTBAXY6WM4kHXJFcNuERKuWuDeHOMilu1y+7T3/a7Jav0QPXrTaWx6YerFKJia2+3rEucY6CQ9waCFR9bhYObYkE6X9kJ71ahCvMYtkr9eXcE6s1rFdKOIgMEtQwxl1Jb8oli9XIBhsosLWHLIUZIfzGwHfmjuVpkfHAfiCIxUJfLiv82sP6EP+Q6n6O/pIeGx0="; STUDY_MIND_TELBIND=1; NETEASE_WDA_UID=12769628#|#1451204764916; NTES_STUDY_YUNXIN_ACCID=s-12769628; NTES_STUDY_YUNXIN_TOKEN=da46d92b7a9504736f2534ed1d366296; STUDY_NOT_SHOW_PROMOTION_WIN=true; utm="eyJjIjoiIiwiY3QiOiIiLCJpIjoiIiwibSI6IiIsInMiOiIiLCJ0IjoiIn0=|aHR0cDovL3N0dWR5LjE2My5jb20vY291cnNlcy1zZWFyY2g/a2V5d29yZD0lRTUlODglOTglRTUlODclQUY="; __utma=129633230.621520113.1509630968.1510452483.1510452504.13; __utmb=129633230.12.9.1510455608393; __utmc=129633230; __utmz=129633230.1509630968.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'headers['Cookie']<-CookiesPayloads=list(
       "pageIndex"=1,
       "pageSize"=50,
       "relativeOffset"=0,
       "frontCategoryId"="400000000158033",       "searchTimeType"=-1,
       "orderType"=0,
       "priceType"=-1,
       "activityId"=0        )

Build an automatic capture function:

GetCourses <-function (url, header = headers, Payload = Payloads) {fullinfo <-data. frame () d <-debugGatherer () handle <-getCurlHandle (debugfunction = d $ update, followlocation = TRUE, cookiefile = "", verbose = TRUE)
For (I in) {Payload [[['pageindex'] = I Payload [['relativeoffset'] = 50 * i-50 tryCatch ({content <-postForm (url ,. opts = list (postfields = toJSON (Payload, auto_unbox = TRUE), httpheader = header ),. encoding = "UTF-8", curl = handle)
### Encode the parameters submitted for post as json strings and encapsulate them in configuration parameter. opts of the postForm function to pass json query parameters !,
###I don't even want to write it in the style, so it's confusing! Note that when toJSON is used for serialization, auto_unbox must be set to TRUE; otherwise, the default value is TRUE, and a single value will be included in the list!Response <-content %> % fromJSON () %> % '[[' (3) %> % '[[' (2) fullinfo <-response %> % rbind (fullinfo ,.) cat (sprintf ("Page [% d] has been crawled! ", I), sep =" \ n ")}, error = function (e) {cat (sprintf (" 【 [% d] Page capture failed! ", I), sep =" \ n ")}) Sys. sleep (runif (1)} cat (" all page is OK !!! ")
Return (fullinfo )}
# Running Functions
Url <-'HTTP: // study.163.com/p/search/studycourse.json'
Myresult <-GetCourses (url)

# Preview Data
DT: datatable (myresult)

Now, the two big data grabbing artifacts (request library), RCurl + httr in the R language have completed exploration and case output for mainstream GET requests and POST requests. In the future, we will continue to add some advanced anti-crawler skills!

Note:The cookie in the above header prevents the request from being blocked by the browser. The edu-script-token parameter is the process token. You can understand it as something similar to the key. So if you want to practice this article, the above two parameters need to be extracted from your Chrome. It is unlikely that data will be generated by directly running this code!

Related course recommendations


R-language crawler case studies: Netease cloud class , Zhihu live, Toutiao.com and site B video


Share content:All the content and cases of this course come from my experiences and notes during the course. I hope to take this opportunity to share my crawler learning experience with you, it also contributed a small contribution to improving the crawler ecology of R language and promoting tools. It is also a phased Summary of crawler learning.

☟☟☟MengRead the original article and join the course now.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.