Python crawler Beginner's crawl to take a joke

Source: Internet
Author: User

Recently began to learn Python crawler, is in this blog followed by learning, the blogger is using Python version 2.7, and I use the 3.5 version, a lot of incompatible places, but it doesn't matter, you change the good.

We want to filter the content of the site and get only the parts that interest you. For example, you want to the XX website to filter out the small yellow map, pack away. This is just a simple implementation, taking the jokes (plain text) of Miss Yee as an example. We want to implement the following functions:

  • Download several pages in bulk to a local file

  • Press any key to start reading the next piece

1. Get the page code

Import urllib the related libraries, which should be written in Python 3:

Import Urllib.requestimport Urllib.parseimport re

Re libraries are regular expressions (Regular expression) that are used when matched.

Miss no sister's satin page url ='http://www.budejie.com/text/1' , where the number 1 at the end represents this as the first page. The following code allows you to return to the Web page content.

    req = urllib.request.Request (URL) # Add headers to make it look like a browser in the Access Req.add_header (' user-agent ', ' mozilla/5.0 ' (Windows NT 6.1; Win64; x64) applewebkit/537.36 "(khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 ')    response = Urllib.request.urlopen (req) # get Web content, note that you must use decode () to decode HTML = Response.read (). Decode (' Utf-8 ')

print(html), the following is what it looks like:

Can you see that? Where's the joke? What about the jokes we want?!

Oh, yes, headers look like this.

Press F12, then ... Look at the picture.

2. Regular match Extraction Satin

To filter content that's readable by ordinary people (if you still have an HTML tag that's not), to extract the jokes, we need some established patterns to match the entire contents of the page, and return the successful objects in the pattern. We use powerful regular expressions to match (Regular expression), and the related syntax can be seen here.

Just for the page content in this example, let's look at what we need to correspond to what's on the page.

You can see that <div class="j-r-list-c-desc">(我们要的内容)</div> the jokes are surrounded by such tags, just specify the rules to extract them! You can see that there are a lot of space before and after the text, need to match in.

Pattern = Re.compile (R ' <div class= "J-r-list-c-desc" >\s+ (. *) \s+</div> ') result = Re.findall (pattern, HTML)

rerules are made through the functions of the library compile .

  • \s+Can match one or more spaces

  • .Matches \n all characters except the line break.

Now that we've got the results of the match, let's see.

Bingo! It's extracted, isn't it?!

But we found there was something nasty in there <br /> . It doesn't matter, write a few lines of code. Here is no longer showing off after the content, self-brain to fill haha.

    In content:# if a <br/>if ' <br/> ' in each:# is replaced with a newline character and output New_each = Re.sub (R ' <br/> ', ' \ n ', ea CH) Print (New_each) # No as usual output else:print (each)

Here content is our re.findall() list by return.

At this point, we succeeded in getting the jokes we wanted to see! What if you want to download it locally?

3. Download the joke to local

By defining a save() function, you can num customize the parameters, and you can download the contents of the last 100 pages without problems! There are also some variables are not mentioned, and finally give the source code.

# num is the specified page number def save (num): # Write to open a text and store the captured jokes list in with open (' A.txt ', ' W ', encoding= ' utf-8 ') as f:        text = get_content (num) # and above remove <br/> Similar for each in text:if ' <br/> ' in each:                New_each = re.sub (R ' <br/> ', ' \ n ', each) 
  f.write (New_each) Else:                f.write (str (each) + ' \ n ')

After downloading to the local document as shown

4. Read the Satin

Too many jokes, a dazzling array. But we just want to read them all. By pressing the keyboard any key can be switched to the next bar until the last program to read the end, or by setting an exit key at any time to exit the program, such as the set q key exit. Here's the whole code.

Import urllib.requestimport Urllib.parseimport Repattern = re.compile (R ' <div class= "J-r-list-c-desc" >\s+ (. *) \s +</div> ') # Returns the contents of the specified Web page def open_url (URL): req = urllib.request.Request (URL) req.add_header (' user-agent ', ' Mozilla /5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 "(khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 ') response = Urllib.request.urlopen (r eq) HTML = response.read (). Decode (' Utf-8 ') return html# num is user-defined and returns a satin list of all pages def get_content (num): # Store the list of jokes text_list = [ ]for page in range (1, int (num)): address = ' http://www.budejie.com/text/' + str (page) HTML = Open_url (addres s) result = Re.findall (pattern, HTML) # The result of each page is a list that adds the contents to the text_listfor every in result:text_list. Append (each) return text_list# num is the specified Web page number def save (num): # Write to open a text, put the captured jokes list in with open (' A.txt ', ' W ', encoding= '                Utf-8 ') as F:text = Get_content (num) # and above remove <br/> like for each in text:if ' <br/> ' in each: New_each= Re.sub (R ' <br/> ', ' \ n ', each) f.write (New_each) else:f.write (str (each) + ' \ n ') if __name__ = = ' __main__ ':p rint (' press Q at any time during reading ') Number = Int (input (' want to read a few pages of content: ')) content = Get_content (number + 1) for each of the content:if ' <br/> ' in Each:new_each = Re.sub (R ' <br/> ', ' \ n ', each) the print (New_each) else:print (each) # user input user_input = input () # case insensitive Q, input exits if user_input = = ' Q ' or user_input = = ' Q ': b Reak

Demo, the effect is like this.

Although the function is very chicken, but as a beginner I am still very satisfied, have the interest to go deep! Crawlers are not just that, they will learn more advanced features later.

by @sunhaiyu

2016.8.15

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.