Python web crawler: Crawl A poem in a poem to make a search

Last Update:2018-08-08 Source: Internet

Author: User

Tags python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python compilation exercises, in order to learn from their own knowledge to use, I find a lot of information. So to be a simple crawler, the code will not exceed 60 lines. Mainly used to crawl the ancient poetry site there is no restrictions and the page layout is very regular, there is nothing special, suitable for entry-level crawler.

Crawl the target site for preparation

The Python version is: 3.4.3.

The goal of crawling is: Ancient poetry net (www.xzslx.net)

At will open an ancient poetry Web page, view its web address can be seen, the address of ancient poetry is basically "www.xzslx.net/shi/+ ID +. html" composition, such as:

And then to the ancient poetry web of Classical Poetry Overview page can see the bottom page break:

A total of 29,830 ancient poems can be crawled to determine the range of numbers behind the "/shi/".

HTML page Parsing library: BeautifulSoup

Installation method: $pip 3 Install BEAUTIFULSOUP4

Main references: "Python network Data Acquisition", chapter first to second

Code parsing:

#!usr/bin/python3# -*- coding:utf-8 -*-import refrom urllib.request import  urlopenfrom urllib.erroe import httperrorfrom bs4 import beautifulsoupdef  GetPage (URL)  :    try :        html  = urlopen (URL)     except HTTPError as e :         return None    try :         bsobj = beautifulsoup (HTML)     except attributeerror  as e :        return None     Return bsobjdef geturl (PG)  :    return  ". Join (' http://www.xzslx.net/ shi/',  str (PG), '. html ') F = open ('./result.txt ',  ' wt ') for pg in range (0,  49149)  :     html = getpage (GetUrl (PG))     cont =  (Html.findall ( ' Div ',  {' class '  :  ' Son2 '))     if cont != none and  len (cont)  > 1 :        cont = cont[1 ].get_text ()         poem = cont[cont.find (' original: ')  +  4:]        sentlist = re.findall (R ' (. *?[.!? ] ',  poem)         for sentc in sentlist :             if  ' Month '  in sentc :                 print  ( sentc,  ' \ t--- < ',  html.find (' H1 '). Get_text (),  ' > ',  file = f)      print  ('--- pAge ', pg,  '  dealed ---')

getPage(url) For the main reference of the function, see the code on the 9th page of Python network data collection. Usingtry...catch...Prevent the collection of the page is abnormal and cause the crawler to terminate.

geturl (PG) &NBSP; function is mainly composed of convenient URLs, understand the join () function is the basis of Python, very simple not to elaborate.

open() function to open the document, here I open a Result.txt The result of the document holding the crawl.

A variable named HTML represents a BeautifulSoup object obtained through the GetPage () function, and observing the original page reveals that the poem content is stored in a div of the attribute "class = ' Son2 '" and is the second such label in the HTML document ( The first such tag is a search box).

Useget_text()function gets<div class = 'son2'> Text content, the whole poem is stored in the "original text:" After, so in the obtained content found in the "original text:" Position and offset 3 positions and a line break a total of 4 characters, then got the original poem content.

A sentence of poetry with ". ”， “！ ”， “？ "At the end, the regular expression of splitting the verse into a sentence is '(.*?[。！？])' , ". *?" represents a non-greedy pattern in the Python regular, and [] the content in the [] is either selected, or () is to find the matching result and store it.

After getting a simple sentence to judge whether the word "month" in the verse can be, there is output to &NBSP; , the next sentence is not judged.

print ('---page', pg, 'dealed---') output Crawl status on the command line to make it easy to visually crawl the progress.

The final result is:

Python web crawler: Crawl A poem in a poem to make a search

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More