This paper illustrates the basic functions of a crawler by crawling the pictures of the travel class in the National Geographic Chinese network. Given the initial address
National Geographic Chinese network: http://www.ngchina.com.cn/travel/ Get and analyze Web page content
A, analysis of the Web page structure, to determine the content of the desired part
We open the Web page, right click on the "Show Web page source code" to view the structure of the page, here is the section I intercepted
We will find that the image type data is placed in the < IMG > tag scr= "", we just find these tags, from which we can extract the connection we want to complete our expectations.
b, get the content of the Web page
To extract the content, we first initiate a request to the server, fetch the file, then analyze and extract the image information, and organize the data to be saved.
The author uses the Python3.6, obtains the webpage content commonly has two kinds of ways: Requests and urllib (merges Urllib and Urllib2 in Python2), regarding obtains the webpage content, please refer to this: The Reptile Foundation: Python Obtains the webpage content
Now, we define a way to get the Web page crawl ()
Import Requests
def crawl (URL, headers): With
requests.get (Url=url, headers=headers) as response:
# Read the contents of the response, and transcoding
data = Response.content.decode () return
data
Call this method to get the contents of the Web page:
# get the specified page content
url = ' http://www.ngchina.com.cn/travel/'
headers = {' user-agent ': ' mozilla/5.0 Linux; Android 6.0; Nexus 5 build/mra58n) applewebkit/537.36 (khtml, like Gecko) chrome/66.0.3359.139 Mobile safari/537.36 '}
content = CR Awl (URL, headers)
print (content)
write regular expressions that match the contents of a picture
Import re
# Writing regular expressions, matching picture content pattern
= R ' src= ' (. *?\.jpg) '
complie_re = Re.compile
Match all content that matches the regular expression
imglist = Complie_re.findall (content)
# for the matching contents, may need further processing, filtering to finally get the data we need, such as new link, We can go back to the new URL queue, wait for the next crawl
# in short, deal with it according to requirements
# to
imglist = List (set (imglist))
# Remove unqualified pictures
imglist = [img for img in imglist if Img.startswith (' http ')]
# Output for
img, i in zip (imglist, Range (len (imglist))):
Print (' {}:{} '. Format (i, IMG))
'
0:http://image.ngchina.com.cn/2018/0428/20180428110510703.jpg
1:http://image.ngchina.com.cn/2018/0130/20180130032001381.jpg
2:http://image.ngchina.com.cn/2018/0424/ 20180424010923371.jpg
...
37:http://image.ngchina.com.cn/2018/0419/20180419014117124.jpg
38:http://image.nationalgeographic.com.cn/ 2017/1127/20171127121516360.jpg
' "
So we grabbed the picture information from the given address, and we picked one:
Http://image.ngchina.com.cn/2018/0428/20180428110510703.jpg
storage and next round of crawl
After we crawl to the specified content, we can store it in the database, if it is link type crawl, we can create a URL queue, the specified URL to add new link to the URL queue, and then round the traversal crawl, for the Queue URL processing, The corresponding strategy needs to be adopted according to the specific requirements to accomplish the task. More crawler information can be referred to: the initial crawler. Add:
When we write regular expressions, we can use the online regular expression tool to quickly see the results of the match: Rookie regular tool, this address there are some commonly used, has been written in the regular expression, such as telephone, QQ number, Web site, mailbox, etc., very good.