The first attempt of R language Crawler-based on rvest packet learning
Thursday, February
After learning Coursera's getting and cleaning data, continue to learn to use R to crawl the crawler network. The main use is the Rvest package developed by Hadley Wickham. Once again to the RTHK RTHK industry to develop a variety of good R package of the great God on the knees
The information read as follows:
- Rvest's GitHub
- Rvest's own Help documentation
- Rvest + CSS Selector best choice for Web data fetching-Daishen: There's a place in the page that mentions how to get HTML quickly. After reading this, think of me before looking at the code to see the half-day segment is really funny. After testing, navigate the browser, right-click, review elements can get similar results. Daishen blog Inside there are a number of related articles, domestic rvest data base on his blog, Grateful!
Anyway, take a few pages practiced hand. Including on the hook net crawl a bit of worms, also tried to a foreign yellow Pages crawler, ebay user evaluation Crawler analysis of what the sellers sell the main price segment (I check the seller, Sell 8.99 and 39.99 The most, shoes Class), did a bit of text mining, and climbed a bit of stock data, fund buying situation and so on.
The reason why put the hook net as an example, because this everyone is more familiar? Others are a little bit =_= and although I do not have the heart, but the beginning of the year is a lot of people job-hopping hot. In addition, because people have said before, to understand the dynamics of a company, there is a way to see the company put out the recruitment positions, you can know which line of business to expand recently, which line of business to run, and understand the technical needs.
Rvest Basic Syntax:
Library (rvest) Lagou<-"http://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?kd=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E% 90&spc=2&pl=&gj=&xl=&yx=&gx=&st=&labelwords=&lc=&workaddress=&city =%e6%b7%b1%e5%9c%b3&requestid=&pn=3"Web<-html (lagou,encoding="UTF-8")#read data, specify encoding#I used the keyword search to read the HTML code to get the attributes in the Html_nodes, but many browsers have developer tools that can get hierarchical information directly. such as travelingPosition<-web%>% Html_nodes ("Li div.hot_pos_l a")%>%Html_text ()#the above is directly read the data, to obtain the location information#However, in the back of other sites found that sometimes the information stored in the same data (such as Div no class, etc.), the proposal is to find a large classification, first get the table information, and then do the dataList_lagou<-web%>% Html_nodes ("Li.clearfix")#It's important to find the right dividing point right here. There are <li class= "odd clearfix", in fact, the same as the Li.clearfix can be taken (for the space two select one, such as "li.odd" or "Li.clearfix")#the next company/position can be selected, because the list has been divided in advance, so each one out of how many hearts.
After finishing the principle, now try to write code
Because it involves too much selection of data work. To avoid having too many variables, I ended up with a function that outputs the database
function section
#The following begins to write the code, first write a function GetData, will output a data framegetdata<-function (page,urlwithoutpage) {URL=paste0 (Urlwithoutpage,page)#here, enter the URL with no page number for the hook net.Web<-html (url,encoding="UTF-8")#read data, specify encoding, access withList_lagou<-web%>% Html_nodes ("Li.clearfix")#get a list of 15 jobsTitle<-list_lagou%>% Html_nodes ("div.hot_pos_l DIV.MB10 a")%>%Html_text () Company<-list_lagou%>% Html_nodes ("Div.hot_pos_r DIV.MB10 a")%>%html_text () Link<-gsub ("\\?source\\=search","", List_lagou%>% Html_nodes ("div.hot_pos_l DIV.MB10 a")%>%html_attr ("href"))#the next, because the data are in span, there is no good division. This fetch is a bit more complicated. I'm here to study their tables, fetch 15 full lists, and then take numbers with seq and so on#and then we have to study if there's a better way.#If there is a table, you can directly use data.table to take the number faster ... Temp<-list_lagou%>% Html_nodes ("div.hot_pos_l span") City<-temp[seq (1,90,by=6)]%>%Html_text () salary<-gsub ("Monthly Salary:","", Temp[seq (2,90,by=6)]%>%Html_text ()) year<-gsub ("Experience:","", Temp[seq (3,90,by=6)]%>%Html_text ()) degree<-gsub ("Minimum Education:","", Temp[seq (4,90,by=6)]%>%Html_text ()) benefit<-gsub ("Position Temptation:","", Temp[seq (5,90,by=6)]%>%Html_text ()) time<-temp[seq (6,90,by=6)]%>%Html_text () data.frame (Title,company,city,salary,year,degree,benefit,time,link)}
And then using this function, I'm going to crawl two pages here.
# Use this function, <- " http:// www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?kd=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&spc=2 &pl=&gj=&xl=&yx=&gx=&st=&labelwords=&lc=&workaddress=&city=%e6%b7%b1% e5%9c%b3&requestid=&pn= Final <-data.frame () for (i Span style= "color: #0000ff;" >in 3:5) {final <-rbind (Final, GetData (I,url)} # number of definitions, Merge the Data.frame from the above GetData Head (final)
The first list is completed. Crawl out effect
What's the use of this data? To put it simply, we can use it to see how much of the Internet is hiring, the percentage of companies that hire, and the level of salary, to do a bit of basic data analysis.
Although I do not job-hopping now, but to understand the market situation is also good ~ See, from the current online average salary and working years of the relationship, data analysis post at least five years before the position of the salary growth period, the initial rise fast, behind the rise slowly, but the average should be about 13% growth? Then this online currently no high-level positions out (job 5-10 years of Post very few), but some companies have mistaken classification, put a bunch of data input to the data analysis column ...
It is worth mentioning that, because the data analysis this category contains different categories, such as data entry is also attributed to the data analysis, as well as the high salary is also attributed to here, so can not be completely according to this reference. But this research has made me understand the effectiveness of the crawler! Fun! Practical! Can be used in the work to go:) Can also be like a headhunter to understand the talent market ~ ~ To be an emotional data analyst ~ ~
In addition, we can actually traverse JD to see what technology is most popular recently, r or Python or SQL or SAS or something. Here is the one I randomly smoked a JD-made reptile. You can get the relevant data directly.
final[1,9]## [1] http://www.lagou.com/jobs/378361.html## levels:http://www.lagou.com/jobs/113293.html ...Url<-as.character (final[1,9]) W<-html (url,encoding ="UTF-8") d<-w%>% Html_nodes ("DD.JOB_BT P")%>%Html_text () d## [1] "1. Finance, computer, finance, economics-related majors;"## [2] "2. A Securities Practitioner Qualification Certificate is preferred;"## [3] "3. Want to work in the civilian category, familiar with Office software;"## [4] "4. Can receive the graduates who have received their academic credentials. " ## [5] "<U+00A0>"
Precautions:
For coded protected data (such as foreign yellow.local.ch,email are encoded protected. You need to decompile with the decodeURIComponent function. )
XPath statements apply to Html_nodes. But it seems to be a global statement. That is, if you use div[1]//span[4] to take a number, it is directly only the result of the global ...
- Such as
- Take a number, you can use Li.da or LI.DAEW to take the number, the two are equivalent
Regular expressions are useful!! Especially on the Web page data, some do not write, or skilled not willing to be our crawler engineers, with Rvest to catch the data, will catch a heap of garbled characters = = These days practice down to feel the endless malicious
Chinese, html(data,encoding=‘UTF-8‘)
but also iconv(data,‘utf-8‘,‘gbk‘)
can effectively avoid most garbled characters. But R is really a waste of support for Chinese.
Rvest is convenient for static gripping! However, for scripts to access the Web page, you also need to continue to learn Rcurl package. The following information is available:
- JavaScript Data Extraction-rcurl Package-Daishen: Introduction to script parsing after data capture experience
- Rcurl extract statistics of the Forum data presentation-medo
And so learned to write the summary again.
And finally, the focus of recent research should be it finance? By Zhang Dan Teacher's two moving averages and R language) inspiring great! I think it is important to learn R. Play the crawler too happy to have no class with jhu ....
Later, you can try to follow your own and Dad's look at stock habits to develop a similar selection model to ~ ~
Well, I've seen a bull man crawling with Python all the big Web site programmer-related job Information: Programmer website Codejob interested can go to see.
My blog
The first attempt of R language Crawler-based on rvest packet learning