Python compilation exercises, in order to learn from their own knowledge to use, I find a lot of information. So to be a simple crawler, the code will not exceed 60 lines. Mainly used to crawl the ancient poetry site there is no restrictions and the page layout is very regular, there is nothing special, suitable for entry-level crawler.
Crawl the target site for preparation
The Python version is: 3.4.3.
The goal of crawling is: Ancient poetry net (www.xzslx.net)
At will open an ancient poetry Web page, view its web address can be seen, the address of ancient poetry is basically "www.xzslx.net/shi/+ ID +. html" composition, such as:
And then to the ancient poetry web of Classical Poetry Overview page can see the bottom page break:
A total of 29,830 ancient poems can be crawled to determine the range of numbers behind the "/shi/".
HTML page Parsing library: BeautifulSoup
Installation method: $pip 3 Install BEAUTIFULSOUP4
Main references: "Python network Data Acquisition", chapter first to second
Code parsing:
#!usr/bin/python3# -*- coding:utf-8 -*-import refrom urllib.request import urlopenfrom urllib.erroe import httperrorfrom bs4 import beautifulsoupdef GetPage (URL) : try : html = urlopen (URL) except HTTPError as e : return None try : bsobj = beautifulsoup (HTML) except attributeerror as e : return None Return bsobjdef geturl (PG) : return ". Join (' http://www.xzslx.net/ shi/', str (PG), '. html ') F = open ('./result.txt ', ' wt ') for pg in range (0, 49149) : html = getpage (GetUrl (PG)) cont = (Html.findall ( ' Div ', {' class ' : ' Son2 ')) if cont != none and len (cont) > 1 : cont = cont[1 ].get_text () poem = cont[cont.find (' original: ') + 4:] sentlist = re.findall (R ' (. *?[.!? ] ', poem) for sentc in sentlist : if ' Month ' in sentc : print ( sentc, ' \ t--- < ', html.find (' H1 '). Get_text (), ' > ', file = f) print ('--- pAge ', pg, ' dealed ---')
getPage(url)
For the main reference of the function, see the code on the 9th page of Python network data collection. Usingtry...catch...
Prevent the collection of the page is abnormal and cause the crawler to terminate.
geturl (PG)
&NBSP; function is mainly composed of convenient URLs, understand the join () function is the basis of Python, very simple not to elaborate.
open()
function to open the document, here I open a Result.txt The result of the document holding the crawl.
A variable named HTML represents a BeautifulSoup object obtained through the GetPage () function, and observing the original page reveals that the poem content is stored in a div of the attribute "class = ' Son2 '" and is the second such label in the HTML document ( The first such tag is a search box).
Useget_text()
function gets<div class = 'son2'>
Text content, the whole poem is stored in the "original text:" After, so in the obtained content found in the "original text:" Position and offset 3 positions and a line break a total of 4 characters, then got the original poem content.
A sentence of poetry with ". ”, “! ”, “? "At the end, the regular expression of splitting the verse into a sentence is '(.*?[。!?])'
, ". *?" represents a non-greedy pattern in the Python regular, and [] the content in the [] is either selected, or () is to find the matching result and store it.
After getting a simple sentence to judge whether the word "month" in the verse can be, there is output to &NBSP; , the next sentence is not judged.
print ('---page', pg, 'dealed---')
output Crawl status on the command line to make it easy to visually crawl the progress.
The final result is:
Python web crawler: Crawl A poem in a poem to make a search