R language data capturing practice-RCurl + XML combination and XPath parsing, rcurlxpath

Source: Internet
Author: User
Tags variable scope

R language data capturing practice-RCurl + XML combination and XPath parsing, rcurlxpath

Du Yu, EasyCharts team member, column creator of the R language Chinese community, with the following interests: Excel Business Chart, R language data visualization, and geographical information data visualization. Personal public account: data cube (ID: datamofang), founder of "data cube.


Some friends often ask me what to do when I encounter null values, missing values, or nonexistent values when I use the R language for network data capturing.

In most cases, the data we capture from the Internet is relational, which requires a one-to-one correspondence between fields and records. However, the structure of html documents varies widely and the code is complicated, it is difficult to ensure that the extracted data is strictly relational, and a large number of missing values and nonexistent content should be judged.

If the raw data is relational, but you are capturing out-of-order fields and records cannot match one by one, the data is usually of little value, today, I use a small case (the same as yesterday's case) to demonstrate how to set logical judgment in web page traversal and loop nesting, and fill in preset values with missing values and nonexistent values in due time, make your crawler code more stable and the output content more regular.

LOAD extension package:

# Package loading:
Library ("XML ")
Library ("stringr ")
Library ("RCurl ")
Library ("dplyr ")
Library ("rvest ")
# Provide URL/header parameters for the target website
Url <-'https: // read.douban.com/search? Q = python'
Header = c ('user-agent' = 'mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.20.3.79 Safari/100 ')

Build a crawling function:

Getcontent <-function (url ){
# This data box is the initial value of myresult = data. frame () returned for the final data summary ()
# These empty vectors are the initial values provided by traversing a single-page book record. title = author = category = subtitle = eveluate_nums = rating = price = c ()
# Start traversing webpages for (page in seq )){
# Traverse different page links <-paste0 (url, '& start =', page * 10)
# Request a webpage and parse content <-getURL (link, httpheader = header) %> % htmlParse () # Calculate the length of the number of entries in a single page <-content %> % xpathSApply (., "// ol [@ class = 'ebook-list column-list']/li") %> % xmlSize ()
### Extract title: title <-content %> % xpathSApply (., "// ol/li // div [@ class = 'title']/a | // ol/li // h4/a", xmlValue) %> % c (title ,.)
### Extract a book category: category = content %> % xpathSApply (., "// span [@ class = 'category']/span [2]/span | // p [@ class = 'category']/span [@ class = 'labled- text '] | // div [@ class = 'category'] ", xmlValue) %> % c (category ,.)
### Extract author/subtitle/comment/Rating/price information: author_text = subtitle_text = eveluate_nums_text = rating_text = price_text = rep ('', length)
For (I in 1: length ){
### Extract author author_text [I] = content %> % xpathSApply (., sprintf ("// li [% d] // p [@ class] // span/following-sibling :: span/a | // li [% d] // div [@ class = 'author']/a ", I, I), xmlValue) %> % paste (., collapse = '/')
### Check whether the subtitle contains if (content %> % xpathSApply (., sprintf ("// ol/li [% d] // p [@ class = 'subtitle']", I), xmlValue) %> % length! = 0) {subtitle_text [I] = content %> % xpathSApply (., sprintf ("// ol/li [% d] // p [@ class = 'subtitle']", I), xmlValue )}
### Check whether the evaluation exists: if (content %> % xpathSApply (., sprintf ("// ol/li [% d] // a [@ class = 'ratings-link']/span", I), xmlValue) %> % length! = 0) {eveluate_nums_text [I] = content %> % xpathSApply (., sprintf ("// ol/li [% d] // a [@ class = 'ratings-link']/span", I), xmlValue )}
### Check whether the score exists: if (content %> % xpathSApply (., sprintf ("// ol/li [% d] // div [@ class = 'rating list-rating']/span [2]", I), xmlValue) %> % length! = 0) {rating_text [I] = content %> % xpathSApply (., sprintf ("// ol/li [% d] // div [@ class = 'rating list-rating']/span [2]", I), xmlValue )}
### Check whether the price exists: if (content %> % xpathSApply (., sprintf ("// ol/li [% d] // span [@ class = 'price-tag']", I), xmlValue) %> % length! = 0) {price_text [I] = content %> % xpathSApply (., sprintf ("// ol/li [% d] // span [@ class = 'price-tag']", I), xmlValue )}}
# Concatenate the number of books whose records are traversed by subscript author = c (author, author_text) subtitle = c (subtitle, subtitle_text) eveluate_nums = c (eveluate_nums, callback) rating = c (rating, rating_text) price = c (price, price_text) # print the single-page task status print (sprintf ("page % d is over !!! ", Page ))}
# Construct the data box myresult = data. frame (title, subtitle, author, category, price, rating, eveluate_nums)
# Print the overall task status print ("everything is OK ")
# Return the final summary data box return (myresult )}

Provide url links and run the crawler function we built:

myresult=getcontent(url)

[1] "page 0 is over !!! "

[1] "page 1 is over !!! "

[1] "page 2 is over !!! "

[1] "page 3 is over !!! "

[1] "everything is OK"


View the data structure:

str(myresult)

Standard variable type:

Myresult $ price <-myresult $ price %> % sub ("Yuan | free ","",.) %> %. numeric () myresult $ rating <-. numeric (myresult $ rating) myresult $ eveluate_nums <-. numeric (myresult $ eveluate_nums)

Preview Data:

DT::datatable(myresult)

To construct an automatic crawling function, the challenge is not only to process missing and non-existent values, but also to set the variable scope. The preceding automatic functions use two-layer for loop nesting, four if statements are used in the for loop in the internal layer. The path of an individual field is not unique. for the sake of data standards, I use the multi-path "|" in XPath ".

The general idea of judging missing values (or filling in nonexistent values) is to traverse the XPath path of each record on each page and determine its length, if the value is 0, the corresponding record does not exist.

By setting a preset vector whose length is length, you only need to insert the existing records (whose length is not 0) into the corresponding position through subscript, if judgment can be written in only half (the second half uses the default null value ).

For the dazzling XPath expressions, refer to this article. You can go to W3C school to view the full version!

R and Python Series 16 on the left-hand -- XPath and webpage Parsing Library

For online courses, click the link at the end of the article:
For previous case data, go to GitHub:
Https://github.com/ljtyduyu/DataWarehouse/tree/master/File

Related course recommendations


R-language crawler case studies: Netease cloud class, zhihu live, toutiao.com, and site B video


Share content:All the content and cases of this course come from my experiences and notes during the course. I hope to take this opportunity to share my crawler learning experience with you, it also contributed a small contribution to improving the crawler ecology of R language and promoting tools. It is also a phased Summary of crawler learning.

☟☟☟MengRead the original article and join the course now.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.