Python web crawler: Crawl A poem in a poem to make a search

Source: Internet
Author: User
Tags python web crawler

Python compilation exercises, in order to learn from their own knowledge to use, I find a lot of information. So to be a simple crawler, the code will not exceed 60 lines. Mainly used to crawl the ancient poetry site there is no restrictions and the page layout is very regular, there is nothing special, suitable for entry-level crawler.


Crawl the target site for preparation


The Python version is: 3.4.3.


The goal of crawling is: Ancient poetry net (www.xzslx.net)


At will open an ancient poetry Web page, view its web address can be seen, the address of ancient poetry is basically "www.xzslx.net/shi/+ ID +. html" composition, such as:



And then to the ancient poetry web of Classical Poetry Overview page can see the bottom page break:


A total of 29,830 ancient poems can be crawled to determine the range of numbers behind the "/shi/".


HTML page Parsing library: BeautifulSoup

Installation method: $pip 3 Install BEAUTIFULSOUP4


Main references: "Python network Data Acquisition", chapter first to second

Code parsing:

#!usr/bin/python3# -*- coding:utf-8 -*-import refrom urllib.request import  urlopenfrom urllib.erroe import httperrorfrom bs4 import beautifulsoupdef  GetPage (URL)  :    try :        html  = urlopen (URL)     except HTTPError as e :         return None    try :         bsobj = beautifulsoup (HTML)     except attributeerror  as e :        return None     Return bsobjdef geturl (PG)  :    return  ". Join (' http://www.xzslx.net/ shi/',  str (PG), '. html ') F = open ('./result.txt ',  ' wt ') for pg in range (0,  49149)  :     html = getpage (GetUrl (PG))     cont =  (Html.findall ( ' Div ',  {' class '  :  ' Son2 '))     if cont != none and  len (cont)  > 1 :        cont = cont[1 ].get_text ()         poem = cont[cont.find (' original: ')  +  4:]        sentlist = re.findall (R ' (. *?[.!? ] ',  poem)         for sentc in sentlist :             if  ' Month '  in sentc :                 print  ( sentc,  ' \ t--- < ',  html.find (' H1 '). Get_text (),  ' > ',  file = f)      print  ('--- pAge ', pg,  '  dealed ---') 

getPage(url) For the main reference of the function, see the code on the 9th page of Python network data collection. Usingtry...catch...Prevent the collection of the page is abnormal and cause the crawler to terminate.

geturl (PG) &NBSP; function is mainly composed of convenient URLs, understand the join () function is the basis of Python, very simple not to elaborate.  

open() function to open the document, here I open a Result.txt The result of the document holding the crawl. 

A variable named HTML represents a BeautifulSoup object obtained through the GetPage () function, and observing the original page reveals that the poem content is stored in a div of the attribute "class = ' Son2 '" and is the second such label in the HTML document ( The first such tag is a search box).

Useget_text()function gets<div class = 'son2'> Text content, the whole poem is stored in the "original text:" After, so in the obtained content found in the "original text:" Position and offset 3 positions and a line break a total of 4 characters, then got the original poem content.

A sentence of poetry with ". ”, “! ”, “? "At the end, the regular expression of splitting the verse into a sentence is  '(.*?[。!?])' , ". *?" represents a non-greedy pattern in the Python regular, and [] the content in the [] is either selected, or () is to find the matching result and store it.

After getting a simple sentence to judge whether the word "month" in the verse can be, there is output to &NBSP; , the next sentence is not judged.

print ('---page', pg, 'dealed---') output Crawl status on the command line to make it easy to visually crawl the progress.


The final result is:

Python web crawler: Crawl A poem in a poem to make a search

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.