Python's crawler technology to crawl the embarrassing Encyclopedia of the method of the explanation

Source: Internet
Author: User
The first learning crawler technology, in the knowledge of how to climb to the embarrassing encyclopedia of the jokes, so I intend to do a.

Achieve the goal: 1, crawl to the embarrassing encyclopedia of the jokes

2, realize every time you crawl a piece, every press ENTER to crawl to the next page

Technology implementation: Based on the implementation of Python, using the requests library, re library, BS4 library BeautifulSoup method to achieve

Main content: First of all we need to sort out the idea of crawling implementation, let us build the main frame. The first step is to write a method of using the requests library to obtain the Web page, and the second step is to use the BeautifulSoup method of the BS4 library to analyze the information obtained and use regular expressions to match the relevant information. The third step is to print out the information we obtained. All of the above methods are executed through a main function.

First, import the relevant library

Import requestsfrom BS4 import beautifulsoupimport bs4import  re

Second, the first page information to obtain

def gethtmltext (URL):    try:        user_agent = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '        headers = {' User-agent ': user_agent}        r = requests.get (url,headers = headers)        r.raise_for_ Status ()        r.encoding = r.apparent_encoding        return r.text    except:        return ""

Third, put the information in R and then parse

Soup = BeautifulSoup (html, "Html.parser")

What we need is the content of the satin and the publisher, through the page's view source code we know the publisher of the joke in:

' P ', attrs={' class ': ' Content '}

The content of the jokes in

' P ', attrs={' class ': ' Author Clearfix '}

So we use the BS4 library method to extract the contents of these two tags.

def fillunivlist (lis,li,html,count):    soup = beautifulsoup (html, "Html.parser")    try:        a = Soup.find_all (' P ', attrs={' class ': ' Content '})        ll = Soup.find_all (' P ', attrs={' class ': ' Author Clearfix '})

Then get the information by specific to the regular expression

For SP in a:    Patten = Re.compile (R ' <span> (. *?) </span> ', Re. S)    Info = Re.findall (Patten,str (SP))    Lis.append (Info)    count = Count + 1for MC in ll:    Namepatten = Re.comp Ile (R ' 

What we need to note is that the FindAll method using Find_all and re is returning a list, using regular expressions when we simply extract and do not remove the newline characters from the label

The next thing we need to do is to combine the contents of the 2 list to output it.

def printunivlist (Lis,li,count): For    I in range (count):        a = li[i][0]        b = lis[i][0]        print ("%s:"%a+ "%s"% b

Then I do an input control function, enter Q to return the error, exit, enter return to the correct, the next page to load

Def input_enter ():    input1 = input ()    if input1 = = ' Q ':        return False    else:        return True

We implement the input control through the main function, and if the control function returns an error, the output is not executed, and if the return is correct, the output continues. We load the next page with a for loop.

def main ():    passage = 0    enable = True for    i in range:        mc = input_enter ()        if mc==true:            lit = []            li = []            count = 0            passage = passage + 1            qbpassage = Passage            print (qbpassage)            url = ' http://www. qiushibaike.com/8hr/page/' + str (qbpassage) + '/?s=4966318 '            a = gethtmltext (URL)            fillunivlist (lit, Li, a, count) Number            = fillunivlist (lit, Li, a, count)            printunivlist (lit, Li, number)        else:            break

Here we need to note that each for loop will refresh the LIS "" and Li "" so that each time the content of the Web page can be correctly printed

For the source code:

Import requestsfrom BS4 import beautifulsoupimport bs4import redef gethtmltext (URL): try:user_agent = ' Mozilla /4.0 (compatible; MSIE 5.5; Windows NT) ' headers = {' User-agent ': user_agent} r = requests.get (url,headers = headers) r.raise_for _status () r.encoding = r.apparent_encoding return r.text except:return "" Def fillunivlist (Lis,li,         Html,count): Soup = beautifulsoup (html, "Html.parser") try:a = Soup.find_all (' P ', attrs={' class ': ' Content '}) ll = Soup.find_all (' P ', attrs={' class ': ' Author Clearfix '}) for sp in a:patten = Re.compile (R ' & Lt;span> (. *?) </span> ', Re. S) Info = Re.findall (Patten,str (SP)) Lis.append (Info) Count = count + 1 for MC in Ll:namepatten = Re.compile (R ' 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.