Python crawler Learning (ii): Targeted Crawler example--using BeautifulSoup crawl "soft science China Best University Rankings-Source quality ranking 2018", and write the results in TXT file

Source: Internet
Author: User
Tags define get

Before a formal crawl, do a test to see how the type of data object crawled is converted to a list:

Write an HTML document:

   x.html
<HTML><Head><title>This is a Python demo page</title></Head><Body> <Pclass= "title"> <a>The demo Python introduces several Python courses.</a> <ahref= "http://www.icourse163.org/course/BIT-133"class= "Py1"ID= "Link1">Basic Python</a> </P> <Pclass= "Course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<ahref= "http://www.icourse163.org/course/BIT-268001"class= "Py1"ID= "Link1">Basic Python</a> and<ahref= "http://www.icourse163.org/course/BIT-1001870001"class= "Py2"ID= "Link2">Advanced Python</a> </P></Body></HTML>
#Coding:utf-8 fromBs4ImportBeautifulSoupImportRequestsImportBS4Soup= BeautifulSoup (Open (' d:/x.html ', encoding= ' utf-8 '),"Html.parser")Print(Soup.find ('Body',). Children)#. Children returns an iterative object, not a list, and needs to iterate through its contents with a For loop forTinchSoup.find ('Body'). Children: Son node of iteration <body> tagifIsinstance (T, Bs4.element.Tag):#determines whether the child node is a tag object (because the child nodes contain nodes such as newline characters)        Print('The contents of the child tag of the body are:', T)#view the object content obtained by the T variable, the child label of body is P tag, a set of <p></p> represents an object        Print('the type of T is:', type (t))#to view the type of T

You can see that the type of each T object is Bs4.element.Tag, which is the label object.

So, what if you want to get the contents of a tag from each T object and save all the a tags in a list?

You can use:

List = t ('a')  # t (' a ') generates a data object of type Bs4.element.ResultSet , which is actually the tag list 
 forTinchSoup.find ('Body'). Children:ifIsinstance (T, Bs4.element.Tag):#determines whether a child tag is a tag object (because a child node contains a node such as a newline character)        #Print (The content of the child tag of the body is: ', T) # View the contents of the object obtained by the T variable        #print (the type of ' t is: ', type (t)) # view type of TList = T ('a')#loops through all the a tags in each T object and saves them to a list        Print(list)Print(Type (list))Print('The contents of the first a label for each p tag:', list[0].string)#once a tag is saved to the list, the list method can be used to parse out each of the A-label objects and get the tag string using the. String

Then you can formally write the crawler:

Analyze Web page source code

You can see some of the information needed, such as university rankings, university names, addresses, scores, and so on, where each university's information is labeled as follows: All university information is in the <tbody> tab, each university is in its own <tr> tag, and then the university's own rankings, The names, addresses, and other information are wrapped by a <td> label, respectively.

Here's the idea: Find all the tags in <tbody> and then find out all the <tr> tag content (why not just use Find_all () to find <tr>)? Because it's not just the university information we need to use <tr> Label,<tbody> is also useful to <tr> tags to wrap content).

I want to take each school's "rank, name, address, score" values are taken out, and each group of data are placed in a list, and then add each list to a large table in turn

(1) Direct processing of data

 fromBs4ImportBeautifulSoupImportRequestsImportBs4url='http://www.zuihaodaxue.com/shengyuanzhiliangpaiming2018.html'R=requests.get (URL) r.encoding= R.apparent_encoding#conversion code, or Chinese will show garbled, also can r.encoding = ' utf-8 'HTML =R.textsoup= BeautifulSoup (HTML,'Html.parser')#gets the BeautifulSoup object for the crawl Web page forTrinchSoup.find ('tbody'). Children:ifisinstance (TR, BS4.ELEMENT.TAG): TD= TR ('TD')
Print (' TD ') T= [Td[0].string, td[1].string, td[2].string, td[3].string]#Put each school's parsed data into a list
Print(t)

Print TD Results:

The result of printing t is as follows, in fact the ranking information can be seen

Then write each university message to a text document in turn:

 fromBs4ImportBeautifulSoupImportRequestsImportBs4url='http://www.zuihaodaxue.com/shengyuanzhiliangpaiming2018.html'R=requests.get (URL) r.encoding= R.apparent_encoding#conversion code, or Chinese will show garbled, also can r.encoding = ' utf-8 'HTML =R.textsoup= BeautifulSoup (HTML,'Html.parser')#gets the BeautifulSoup object for the crawl Web page forTrinchSoup.find ('tbody'). Children:ifisinstance (TR, BS4.ELEMENT.TAG): TD= TR ('TD') T= [Td[0].string, td[1].string, td[2].string, td[3].string]#put the data from each school into a single list print (t)With open ('D:/test.txt','a'as data: # Open the file in ' a ' mode toAppend writes without changing the original content
Print(t, file=data)

(2) Encapsulate the code and write it into the function

#Coding:utf-8ImportRequestsImportBS4 fromBs4ImportBeautifulSoupdefget_html (URL):
"" "Define get Web page source code function" ""Try: R= Requests.get (URL, timeout=20) r.encoding=r.apparent_encodingreturnR.textexcept: returnNonedefget_data (HTML, list):
"" Defines the data from the source of the Web page and processes the data function " " "Soup= BeautifulSoup (HTML,'Html.parser') forTrinchSoup.find ('tbody'). Children:ifisinstance (TR, BS4.ELEMENT.TAG): TD= TR ('TD') T= [Td[0].string, td[1].string, td[2].string, td[3].string]#put the data from each school resolution into a listList.append (t)#Append each school information list to a large list, allowing you to write to the file later #return List # cannot be added to return, the result is the first cycle when the results are returned, only the first data is taken defWrite_data (Ulist, num):#num parameter, which controls how many groups of data are extracted to write to the file
"" Defines writing data to the file function "" " forIinchrange (num): U=Ulist[i] with open ('D:/test.txt','a') as data:Print(U, file=data)if __name__=='__main__': List= [] # I previously put list=[] in the for loop of the Get_data () function, resulting in each loop emptying the list before appending the data, and finally traversing the last set of data ...
URL='http://www.zuihaodaxue.com/shengyuanzhiliangpaiming2018.html'
HTML=get_html (URL)
Get_data (HTML, list)
Write_data (list,20)

Python crawler Learning (ii): Targeted Crawler example--using BeautifulSoup crawl "soft science China Best University Rankings-Source quality ranking 2018", and write the results in TXT file

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.