The first learning crawler technology, in the knowledge of how to climb to the embarrassing encyclopedia of the jokes, so I intend to do a.
Achieve the goal: 1, crawl to the embarrassing encyclopedia of the jokes
2, realize every time you crawl a piece, every press ENTER to crawl to the next page
Technology implementation: Based on the implementation of Python, using the requests library, re library, BS4 library BeautifulSoup method to achieve
Main content: First of all we need to sort out the idea of crawling implementation, let us build the main frame. The first step is to write a method of using the requests library to obtain the Web page, and the second step is to use the BeautifulSoup method of the BS4 library to analyze the information obtained and use regular expressions to match the relevant information. The third step is to print out the information we obtained. All of the above methods are executed through a main function.
First, import the relevant library
Import requestsfrom BS4 import beautifulsoupimport bs4import re
Second, the first page information to obtain
def gethtmltext (URL): try: user_agent = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) ' headers = {' User-agent ': user_agent} r = requests.get (url,headers = headers) r.raise_for_ Status () r.encoding = r.apparent_encoding return r.text except: return ""
Third, put the information in R and then parse
Soup = BeautifulSoup (html, "Html.parser")
What we need is the content of the satin and the publisher, through the page's view source code we know the publisher of the joke in:
' P ', attrs={' class ': ' Content '}
The content of the jokes in
' P ', attrs={' class ': ' Author Clearfix '}
So we use the BS4 library method to extract the contents of these two tags.
def fillunivlist (lis,li,html,count): soup = beautifulsoup (html, "Html.parser") try: a = Soup.find_all (' P ', attrs={' class ': ' Content '}) ll = Soup.find_all (' P ', attrs={' class ': ' Author Clearfix '})
Then get the information by specific to the regular expression
For SP in a: Patten = Re.compile (R ' <span> (. *?) </span> ', Re. S) Info = Re.findall (Patten,str (SP)) Lis.append (Info) count = Count + 1for MC in ll: Namepatten = Re.comp Ile (R '
What we need to note is that the FindAll method using Find_all and re is returning a list, using regular expressions when we simply extract and do not remove the newline characters from the label
The next thing we need to do is to combine the contents of the 2 list to output it.
def printunivlist (Lis,li,count): For I in range (count): a = li[i][0] b = lis[i][0] print ("%s:"%a+ "%s"% b
Then I do an input control function, enter Q to return the error, exit, enter return to the correct, the next page to load
Def input_enter (): input1 = input () if input1 = = ' Q ': return False else: return True
We implement the input control through the main function, and if the control function returns an error, the output is not executed, and if the return is correct, the output continues. We load the next page with a for loop.
def main (): passage = 0 enable = True for i in range: mc = input_enter () if mc==true: lit = [] li = [] count = 0 passage = passage + 1 qbpassage = Passage print (qbpassage) url = ' http://www. qiushibaike.com/8hr/page/' + str (qbpassage) + '/?s=4966318 ' a = gethtmltext (URL) fillunivlist (lit, Li, a, count) Number = fillunivlist (lit, Li, a, count) printunivlist (lit, Li, number) else: break
Here we need to note that each for loop will refresh the LIS "" and Li "" so that each time the content of the Web page can be correctly printed
For the source code:
Import requestsfrom BS4 import beautifulsoupimport bs4import redef gethtmltext (URL): try:user_agent = ' Mozilla /4.0 (compatible; MSIE 5.5; Windows NT) ' headers = {' User-agent ': user_agent} r = requests.get (url,headers = headers) r.raise_for _status () r.encoding = r.apparent_encoding return r.text except:return "" Def fillunivlist (Lis,li, Html,count): Soup = beautifulsoup (html, "Html.parser") try:a = Soup.find_all (' P ', attrs={' class ': ' Content '}) ll = Soup.find_all (' P ', attrs={' class ': ' Author Clearfix '}) for sp in a:patten = Re.compile (R ' & Lt;span> (. *?) </span> ', Re. S) Info = Re.findall (Patten,str (SP)) Lis.append (Info) Count = count + 1 for MC in Ll:namepatten = Re.compile (R '