R Crawler Combat 1 (learning)-Based on rvest package

Source: Internet
Author: User

Here is the Rvest package developed with Hadley Wickham. Once again to the RTHK RTHK industry to develop a variety of good R package of the great God on the knees.

The information read as follows:

    1. Rvest's GitHub
    2. Rvest's own Help documentation
    3. Rvest + CSS Selector best choice for Web data fetching-Daishen: There's a place in the page that mentions how to get HTML quickly. After reading this, think of me before looking at the code to see the half-day segment is really funny. After testing, navigate the browser, right-click, review elements can get similar results. Daishen blog Inside there are a number of related articles, domestic rvest data base on his blog, Grateful!

Anyway, take a few pages practiced hand. Including the pull hook net crawl a bit of worms, but also tried to a foreign yellow Pages crawler, ebay user evaluation Crawler analysis of what the seller sells the main price segment (I check the seller, sell 8.99 and 39.99 the most, shoes), did a bit of text mining, and climbed a bit of stock data, fund buying situation and so on.

The reason why put the hook net as an example, because this everyone is more familiar? Others are a little bit =_= and although I do not have the heart, but the beginning of the year is a lot of people job-hopping hot. In addition, because people have said before, to understand the dynamics of a company, there is a way to see the company put out the recruitment positions, you can know which line of business to expand recently, which line of business to run, and understand the technical needs.

Rvest Basic Syntax:
Library (rvest) lagou<-"http://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?kd=%E6%95%B0%E6% 8d%ae%e5%88%86%e6%9e%90&spc=2&pl=&gj=&xl=&yx=&gx=&st=&labelwords=&lc
=&workaddress=&city=%e6%b7%b1%e5%9c%b3&requestid=&pn=3 "web<-html (lagou,encoding=" UTF-8 ") # Read the data, specify the code # before I was searching with the keyword, reading the HTML code, getting what attributes are needed in Html_nodes, but many browsers have developer tools that can get hierarchical information directly. If traveling position<-web%>% html_nodes ("Li div.hot_pos_l a")%>% Html_text () #上面就是直接读取数据, get location information # But in the back of doing other sites found that Sometimes the information is stored in the same kind of data (such as div without class, etc.), the suggestion is to find a large classification, first get the table information, then do the data list_lagou<-web%>% html_nodes ("Li.clearfix") # It's important to find the right dividing point right here. There are <li class= "odd clearfix", in fact, the same as the Li.clearfix can be taken (for the space two select one, such as "li.odd" or "Li.clearfix") #接下来的company/position according to the selection, Because the list has been well-established beforehand, so every one out of the number of hearts.

After finishing the principle, now try to write code

Because it involves too much selection of data work. To avoid too many variables, I ended up with a function to output the database.

function section
#下面开始写代码, first write a function GetData, will output a data frame getdata<-function (page,urlwithoutpage) {url=paste0 (urlwithoutpage,page) # Here, enter the URL web<-html (url,encoding= "UTF-8") with no page number, #读取数据, specify the encoding, Access uses List_lagou<-web%>% html_nodes ("  Li.clearfix ") #获得一个清单, 15 positions Title<-list_lagou%>% html_nodes (" div.hot_pos_l div.mb10 a ")%>%html_text () Company<-list_lagou%>% html_nodes ("Div.hot_pos_r div.mb10 a")%>%html_text () link<-gsub ("\\?source\\= Search "," ", List_lagou%>% html_nodes (" div.hot_pos_l div.mb10 a ")%>%html_attr (" href ")) #接下来的由于数据都存在span里, There is no good division. This fetch is a bit more complicated.  I'm here to study their tables, fetch 15 full lists first, and then use the SEQ sequence to fetch the number # to see if there is a better way # if there is a table, you can directly use data.table to take the number faster ... Temp<-list_lagou%>% html_nodes ("div.hot_pos_l span") City<-temp[seq (1,90,by=6)]%>% Html_text () salary <-gsub ("Salary:", "", Temp[seq (2,90,by=6)]%>% html_text ()) Year<-gsub ("Experience:", "", Temp[seq (3,90,by=6)]%>% html _text ()) degree<-gsub ("Minimum Education:", "", Temp[seq (4,90,by=6)]%>%html_text ()) benefit<-gsub ("Position Temptation:", "", Temp[seQ (5,90,by=6)]%>% html_text ()) time<-temp[seq (6,90,by=6)]%>%html_text () Data.frame (title,company,city, Salary,year,degree,benefit,time,link)}
Get the function, here crawl a page first!
> url<-"http://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?kd=%E6%95%B0%E6%8D%AE%E5%88% 86%e6%9e%90&spc=2&pl=&gj=&xl=&yx=&gx=&st=&labelwords=&lc=&workaddress= &city=%e6%b7%b1%e5%9c%b3&requestid=&pn= "> Final<-data.frame () > for (i in 3) {+     final<- Rbind (Final,getdata (i,url))        +} #定义个数, merge the data.frame from the GetData above
View Crawl Results
Analyze data
What's the use of this data? To put it simply, we can use it to see how much of the Internet is hiring, the percentage of companies that hire, and the level of salary, to do a bit of basic data analysis.

Although I do not job-hopping now, but to understand the market situation is also good ~ See, from the current online average salary and working years of the relationship, data analysis post at least five years before the position of the salary growth period, the initial rise fast, behind the rise slowly, but the average should be about 13% growth? Then there is no high-level posts on the net at present (job 5-10 years of Post very few), but some companies have mistaken classification, put a bunch of data input to the data analysis column.

(Waiting for ...) The code also needs to be perfected, to set the interval time, otherwise it will be taken off!!! )

R Crawler Combat 1 (learning)-Based on rvest package

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.