R language from the small wooden Insect web page batch extraction of postgraduate

Source: Internet
Author: User
Tags xpath

One, read from URL and return HTML tree

1.1 Rcurl Bag

The Rcurl package makes it easy to make requests to the server to capture Uri,get and post forms. Higher levels of interaction than r socktet connections, and support for ftp/ftps/tftp,ssl/https,telnet and cookies. The functions used in this paper are Basictextgatherer and Geturl. To learn more about this package you can click on the resources link.

R command:

H <-basictextgatherer () # View header information returned by the server
txt <-getURL (url, headerfunction = h$update,.encoding= "UTF-8 ...") # Returns the string form of HTML

The parameter URL is the URL that needs to be accessed here the parameter uses Headerfunction to return the header information to the previous command,. Encoding the specified page is encoded as "UTF-8".

There are many ways to encode Web pages, generally using UTF-8, some Chinese web page encoding for "GBK", can be viewed in the browser's page code or geturl the returned string to see.

Small wooden Worm page code view

The GBK Web page is encoded in the form of a small wooden worm.

1.2 XML Package

the R language XML package has the ability to read or create XML (HTML) files, and it can also support HTTP or FTP for local files, and also provides an XPath (XML Path Language) parsing method. Here the function Htmlparse, parse the file into XML or HTML tree, so as to facilitate further data extraction or editing.

R command:

htmlparse (file,astext=t,encoding= "UTF-8" ...) The #参数file is either an XML or an HTML file name or the Text,astext parameter is t specifies that file is text,encoding the specified page encoding method.

Here we need to read the page and get the HTML tree content of the page

Custom Function Download, enter Strurl,strurl as URL, return HTML tree content

Download <-function (strURL) {
H <-basictextgatherer () # View header information returned by the server
TXT <-getURL (strurl, headerfunction = h$update,.encoding= "GBK") # # string form
Htmlparse (txt,astext=t,encoding= "GBK") #选择gbk进行网页的解析
}

Second, get a page all the URL

Sometimes we need to go to the sub-link on each page to take the analysis data, this time can use the XML packet gethtmllinks function.

R command:

Gethtmllinks (doc, xpquery = "//a/@href" ...) #doc为解析后的HTML树文件, xpquery Specifies the XPath element that you want to match (a bit more on the XPath Foundation below).

Here we need to get links to all the topics under the "Mentor Admissions" page of the little wood worm.

2.1 First we want to get the first page of the tutor admissions, the second page, the third page, and even the URL to the last page.

Tutor Admissions Home

Tutor Admissions second page, Page three.

Find the homepage URL is http://muchong.com/html/f430.html, the remaining URL conforms to http://muchong.com/html/f430_ + page +.html

So we can edit the URL manually.

Strurls= "Http://muchong.com/html/f430.html"

N=50

Strurls <-C (Strurls,paste (Rep ("http://muchong.com/html/f430_", N), C (2:n), ". html", sep= ""))

Strurls includes all 1 to 50 pages of Tutor admissions Web site.

2.2 Get a link to multiple topics on each page of the tutor's admissions

There are many topics under the Tutor Admissions page, and we need to get links to various topics.

Use the Gethtmllinks function to see all the URLs within the tutor's admissions, and then compare the topic URLs.

Http://muchong.com/html/201702/11075436.html

Find the topic URL is composed of components is http://muchong.com/+ html/201702/11075436.html similar URLs

At this point I used to first extract all the URLs from the Tutor's admissions page, and then match the HTML *. html URL, followed by the http://muchong.com/strategy.

The custom Greg function is used for regular matching and gets the string to match.
Greg <-Function (pattern,istring) {
Gregout <-gregexpr (pattern,istring) #pattern为匹配模式, istring is the string to be matched
substr (Istring,gregout[[1]],gregout[[1]]+attr (gregout[[1]], ' match.length ')-1)
}

The custom extradress function, which extracts the URLs from the strURL Web page, and finally handles the links that return the pages of each topic.

Extradress <-Function (strurl) {
Prefix <-"http://muchong.com/"
Pattern <-"html/[0-9/]+.html"
Links <-Gethtmllinks (strURL)
Needlinks <-gregexpr (pattern,links)
Needlinkslist <-list ()
For (I-in which (Unlist (needlinks) >0)) {
Preadress <-substr (Links[i],needlinks[[i]],needlinks[[i]]+attr (Needlinks[[i]], ' match.length ')-1)
needlinkslist<-C (Needlinkslist,list (preadress))
Adresses <-lapply (needlinkslist,function (x) paste (prefix,x,sep= ""))
}
Return (adresses)
}

Third, get the data we want from the HTML tree

3.1 Basic knowledge of XML documents

The following is part of the HTML for the Little Wood worm:

HTML is the root element, the head and body are the child elements of the HTML, Div is the child of the body, Div has the attribute Id,style, the property is followed by the property value. The "Little wood worm---" line is the text content of the P element.

3.2 Getting the content of an element

The Getnodeset function in the XML package is used here, and the Getnodeset function

R command:

Getnodeset (doc, path ...) #doc is the HTML tree file object, and path is the element path. You can specify a path with/from the root element layer, or you can navigate directly to a layer element using//.

For example, to navigate to the body under the HTML Div,path is/html/body/div, can also be//body/div directly from the body to start positioning. Returns a list of multiple elements that will be returned if you navigate to multiple elements. This time we will be set to the topic of the page:

Here we navigate directly to the P element and then filter from the list.

Enter Command first

Getnodeset (Doc, '//p ')

Getnodeset (Doc, '//p ') [[2]] is what we need.

But the returned result is an object, to be converted into a string to use the function xmlvalue to get the element value.

xmlvalue (x ...) # x is the object Getnodeset get

Here

Xmlvalue (Getnodeset (A, '//p ') [[2]] get what we want





At this point, we get the content of each topic, we can extract effective information from the content, whether to recruit, university name, tutor name, research direction, contact person, mailbox, telephone and so on.

Iv. examples of obtaining dispensing information from small wooden insects

My sister is a biological professional needs to transfer students, now need to extract other people from the Web site to publish information, made a tabular form, easy to filter to view and send mail.

The following is all the code content

Library (Rcurl)
Library (XML)

Download <-function (strURL) {
H <-basictextgatherer () # View header information returned by the server
TXT <-getURL (strurl, headerfunction = h$update,.encoding= "GBK") # # string form
Htmlparse (txt,astext=t,encoding= "GBK") #选择gbk进行网页的解析
}

Extradress <-Function (strurl) {
Prefix <-"http://muchong.com/"
Pattern <-"html/[0-9/]+.html"
Links <-Gethtmllinks (strURL)
Needlinks <-gregexpr (pattern,links)
Needlinkslist <-list ()
For (I-in which (Unlist (needlinks) >0)) {
Preadress <-substr (Links[i],needlinks[[i]],needlinks[[i]]+attr (Needlinks[[i]], ' match.length ')-1)
needlinkslist<-C (Needlinkslist,list (preadress))
Adresses <-lapply (needlinkslist,function (x) paste (prefix,x,sep= ""))
}
Return (adresses)
}

Gettopic <-function (DOC) {
Xmlvalue (Getnodeset (Doc, '//p ') [[2]])
}

Greg <-Function (pattern,istring) {
Gregout <-gregexpr (pattern,istring)
substr (Istring,gregout[[1]],gregout[[1]]+attr (gregout[[1]], ' match.length ')-1)
}

Getinf <-Function (topic) {
PATTERN1 <-"Recruit [\u4e00-\u9fa5]+[0-9-]*[\u4e00-\u9fa5]*[:,;,,;] *[\u4e00-\u9fa5]*[:,;,,;] *[\u4e00-\u9fa5]*[:,;,,;] *[\u4e00-\u9fa5]*[:,;,,;] *[\u4e00-\u9fa5]* (Postgraduate) | (dispensing) "
Pattern2 <-"([\u4e00-\u9fa5]* Research Group |[ \u4e00-\u9fa5]* team) "
Pattern21 <-"[\u4e00-\u9fa5]*[:,;,,;] * (Professor | phd) "
Pattern3 <-"[\u4e00-\u9fa5]*[:,;,,;] *[-a-za-z0-9_.%]+@[-a-za-z0-9_.%]+\\. [A-za-z]+[. A-za-z]* "
#匹配 @163.com class or @abc.edu.cn two types of mailboxes
Pattern4 <-"[\u4e00-\u9fa5]+ teacher" #匹配某老师
Pattern5 <-"[\u4e00-\u9fa5]*[::]*1[3,5,8]{1}[0-9]{1}[0-9]{8}|0[0-9]{2,3}-[0-9]{7,8} (-[0-9]{1,4})?" #匹配联系人和号码
PATTERN6 <-"(main | engaged) *[\u4e00-\u9fa5]* (research | direction) for *[:,;,,;] *[\u4e00-\u9fa5]* "
Pattern7 <-"[\u4e00-\u9fa5]+ (University | academy | Research Institute)"
Pattern8 <-"[-a-za-z0-9_.%]+@[-a-za-z0-9_.%]+\\. [A-za-z]+[. a-za-z]* "#精确匹配邮箱


Cate <-Greg (Pattern1,topic)
Proj <-Greg (Pattern2,topic)
PI <-Greg (Pattern21,topic)
Email <-Greg (pattern3,topic)
Man <-Greg (Pattern4,topic)
Phone <-Greg (Pattern5,topic)
Direc <-Greg (Pattern6,topic)
Univ <-Greg (Pattern7,topic)
Print (Cate)
if (Greg (molecule | biology | plant | cell | medicine | animal | water) + ", topic)! =" ") {
if (man = = "" && proj! = ") {
Man <-unlist (Strsplit (proj, "subject Group") [1])
}

if (Email! = "") {
Email <-Greg (pattern10,email)
}

Data.frame ("category" =cate, "university" =univ, "subject" =proj, "PI" =pi, "Contacts" =man, "Mailbox" =email, "direction" =direc, "Phone" =phone)
}
else{
Return ("")
}
}

Strurls= "Http://muchong.com/html/f430.html"
N=50
DAT <-data.frame ("url" = "url", "category" = "category", "university" = "University", "subject" = "subject", "PI" = "PI", "contact" = "Contact", "Mailbox" = "Mailbox", "direction" = "direction", "Phone" = " Phone ")
Strurls <-C (Strurls,paste (Rep ("http://muchong.com/html/f430_", N), C (2:n), ". html", sep= ""))
OUTPUT1 <-"A2017.2.21.txt" #未处理数据 for further processing
Output2 <-"B2017.2.21.txt" #进一步筛选的数据 for viewing

For (strURL in Strurls) {
Adresses <-extradress (strURL)
For (adress in adresses) {
Message (adress)
Doc <-Download (adress)
Topic <-Gettopic (DOC)
INF <-getinf (topic)
if (INF! = "") {
URL <-data.frame ("url" =adress)
INF <-Cbind (Url,inf)
dat<-Rbind (Dat,inf)
}
}
}

write.table (dat, file = output, Row.names = f, col.names=f,quote = f, sep= "\ t") # tab-delimited file
Message ("Done! ")

DAT <-read.table (output1,sep= "\ T", header=t)
DAT <-dat[dat$ mailbox,] #去除没有邮箱数据
DAT <-dat[!duplicated (dat$ email),] #去除重复邮箱数据
Dat$index <-as.numeric (rownames (DAT))
DAT <-dat[order (dat$index,decreasing=f),] #将乱序后的数据重新按照index排序
Dat$index <-NULL
write.table (dat, file = output2, Row.names = f, col.names=f,quote = f, sep= "\ t") # tab-delimited file
Message ("Done! ")

Finally, I wish all the postgraduate students can successfully be the desired school admission!



Resources:

Rcurl Bag: https://cran.r-project.org/web/packages/RCurl/RCurl.pdf

XML Package: Https://cran.r-project.org/web/packages/XML/XML.pdf

Basic knowledge of XML: http://www.cnblogs.com/thinkers-dym/p/4090840.html

R language from the small wooden Insect web page batch extraction of postgraduate

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.