python3.x crawler

Last Update:2017-06-17 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Compared to People's article today the installation method tested a viable copy of the record

1 background

These two days is more busy, all kinds of pot to connect, sneak in the end of this article. In our previous "python3.x crawler combat (first climb up hi)" has introduced the basic knowledge of Python 3 crawler, and finally through a not very rigorous small reptile program to show its powerful charm. Some people say on a "python3.x crawler (first climb up hi)" in the suspected force Amway python, yes, justified Amway, is so wayward, in short this thing for me in a lot of small tools to get the efficiency of the promotion, really good, but also someone asked me at first because of what opportunity contact Python, which can only be said to be used to do Android 4.1 Framework time Difference sub-package construction processing that Google is officially used by the Pyhton script to deal with, but also a job to be forced to learn it, but there is no get to Python a lot of horizontal expansion, with The expansion of the focus of the world, gradually so captured.

To the point, we go back to the reptile topic, the last one we summed up a bot process, which has two core process is static downloader (personal names, opposition to dynamic Web download processing, the following series of articles will be introduced) and parser, naturally this one of our core is to explore the choice of these two major steps.

"Craftsman Joshui Http://blog.csdn.net/yanbober without permission to reprint, please respect the author's labor results." Private Messages Contact Me "

2 Python3 crawler Static downloader

When we get a URL through the scheduler in the URL manager to do the first thing to do is to give the download URL link to download the access, and for the regular HTTP Web page download generally can be completed in a short time (imagine the next Web page in the browser and other half a day is not open is a kind of experience), But do not rule out network exceptions, access links illegal, WEB site server exceptions, etc., so to achieve a relatively robust downloader we need to consider a lot of questions, about the details of logical optimization and robustness on their own slowly optimized. Here's a brief technical note for the downloader (the detailed usage of the modules for these Python3 can be learned in extra detail).

[The example full source point I view]

‘‘‘
The following is a slightly more robust downloader than the previous one implemented using the Python3 built-in module.
With the built-in Urllib for header settings or proxy settings or session enablement, a simple HTTP CODE 5XX retry mechanism is supported, supporting Get\post.
(Actual project considerations and encapsulation are more robust than this)
‘‘‘
From HTTP import Cookiejar
From urllib import request, error
From Urllib.parse import Urlparse

Class Htmldownloader (object):
def download (self, URL, retry_count=3, Headers=none, Proxy=none, Data=none):
If URL is None:
Return None
Try
req = Request. Request (URL, headers=headers, data=data)
Cookie = Cookiejar. Cookiejar ()
cookie_process = Request. Httpcookieprocessor (Cookie)
Opener = Request.build_opener ()
If proxy:
Proxies = {urlparse (URL). Scheme:proxy}
Opener.add_handler (Request. Proxyhandler (proxies))
Content = Opener.open (req). Read ()
except error. Urlerror as E:
Print (' Htmldownloader download error: ', E.reason)
Content = None
If Retry_count > 0:
If Hasattr (E, ' Code ') and <= E.code < 600:
#说明是 httperror Error and HTTP CODE 5XX range description is a server error, you can try to download it again
Return self.download (URL, retry_count-1, headers, proxy, data)
return content
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
[The example full source point I view]

‘‘‘
The following is a downloader implemented using the Python3 external module requests
A simple retry mechanism is supported through header settings or proxy settings and support sessions.
(Actual project considerations and encapsulation are more robust than this, install the module using the command: Pip install requests)
‘‘‘
Import requests
From requests import Timeout
‘‘‘
http://docs.python-requests.org/en/master/
‘‘‘
Class Downloader (object):
def __init__ (self):
Self.request_session = Requests.session ()
Self.request_session.proxies

def download (self, URL, retry_count=3, Headers=none, Proxies=none, Data=none):
‘‘‘
:p Aram URL: Ready to download URL link
:p aram retry_count: If URL download failed retry attempts
:p Aram Headers:http header={' x ': ' x ', ' x ': ' X '}
:p Aram Proxies: Proxy settings proxies={"https": "http://12.112.122.12:3212"}
:p Aram Data: UrlEncode (Post_data) required post
: return: Web content or None
‘‘‘
If headers:
Self.request_session.headers.update (Headers)
Try
If data:
Content = self.request_session.post (URL, data, proxies=proxies). Content
Else
Content = Self.request_session.get (URL, proxies=proxies). Content
Except (Connectionerror, Timeout) as E:
Print (' Downloader download Connectionerror or Timeout: ' + str (e))
Content = None
If Retry_count > 0:
Self.download (URL, retry_count-1, headers, proxies, data)
Except Exception as E:
Print (' Downloader download Exception: ' + str (e))
Content = None
return content
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
How, through the above two download code we can find that the general Python3 Network request (downloader) either use the internal module urllib, or use the external module requests, but the effect is the same, just a package and convenient relationship. Of course, if you do not like these two, you can also find the use of other open Source network request module, to achieve the purpose of the line, anyway, is a request.

Can see, through the static downloader actually get the URL link corresponding to the static content of the site (some of the pages are static, some are dynamic), for static web crawler actually we get the data through the downloader enough, for the dynamic Web pages of our subsequent article re-analysis. In view of this, we should then give the static Downloader download page content to the parser processing.

"Craftsman Joshui Http://blog.csdn.net/yanbober without permission to reprint, please respect the author's labor results." Private Messages Contact Me "

3 Python3 crawler Static parser

With the content of the page downloaded from the last part of the static downloader, the next thing we want to do is to parse the content, that is, to crawl the valuable data-parser in these pages according to their own rules. For Python crawler parsing commonly used routines have direct regular matching, beautifulsoup, lxml this kind of (of course, there are other, but commonly used in the mainstream on this kind of), the following we respectively explain.

3-1 Regular Match Resolver

As the name implies is the regular expression matching search filter, if you are not familiar with the regular expression, it is recommended that you first look at the previous I wrote the "Regular expression Basics" article, and then to learn Python3 regular matching parser, the amount, the essence is Python string regular matching, and then popular point is Python Re module, the use of re in the crawler we should pay attention to the following several routines:

When using the Python re regular module, we recommend that you keep the regular string with the R prefix, avoiding the pot because the escape brings the pit daddy, because the regular is very flexible, complex is very obscure.

When we use the Re.compile (EXP_STR) method because the re internal compiler exp_str regular expression is legitimate, and then use the compiled expression to match, and the crawler is generally based on a specified regular expression on hundreds of thousands of pages to match, so in order to efficiency as far as possible The Re.compile (Exp_str) method caches, in short, avoids multiple calls to the same, avoiding efficiency problems.

As shown in the "Regular Expression Basics" article, try to write the regular of non-greedy mode, which is greedy matching by default.

Group (x) method of re when grouping matches output beware, group (0) is the original string, group (1), Group (2) ... Is the 1th, 2 、...... A grouping substring, remember the routines.

Note that when you write the compile (pattern, flags=0) second parameter meaning, beware of routines, such as we want '. ' to match ' \ n ' in Dotall mode, it's important to note that flags are set to re. S and so on.

If you read the basics of regular expression and understand the regular expression but not the Python re module, it is recommended to look at the Python regular expression guide on the Web.

No BB, let's see a download the static page content after downloading to the regular parser processing example, the following is the crawl parsing CSDN my blog comment management list of each item's article name article, article link URL, comment person name commentator, Comment Time, comment content, and then generate a dictionary list to save the parsed data, the page content to parse is as follows:

Write a picture description here

The parser code is as follows [the example full source point I view]:

def get_page_feedback_dict (self, page_index=1):
‘‘‘
Get a comment list of my posts in the Comments Admin page of the Csdn my blog page (available by page of comments)
: return: {' maxPage ':, ' dict ', [{' article ': ' xxx ', ' url ': ' xxx ', ' commentator ': ' xxx ', ' time ': ' xxx ', ' content ': ' xxx '}] }
‘‘‘
Content = Self.opener.open (Self.url_feedback+str (Page_index)). Read (). Decode ("Utf-8")
Print (content)
Max_page = Re.search (Re.compile (R ' <div class= "Page_nav" ><span>.*) (\d+) page </span> '), content). Group (1)
Reg_main = Re.compile (r "<tr class= ' altitem ' >.*?<a href= ' (. *?) '. *?> (. *?) </a></td><td><a.*?class= ' user_name ' target=_blank> (. *?) </a></td><td> (. *?) </td>.*?<div class= ' Recon ' > (. *?) </div></td></tr> ", Re. S
Main_items = Re.findall (reg_main, content)
Dict_list = List ()
For item in Main_items:
Dict_list.append ({
' URL ': item[0],
' article ': item[1],
' Commentator ': item[2],
' Time ': item[3],
' content ': item[4]
})
Print (str (dict_list))
return {' MaxPage ': max_page, ' dict ': dict_list}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
Get to the dict_list after parsing the dictionary list as follows:

[
{
' URL ': ' Http://blog.csdn.net/yanbober/article/details/73162298#comments ',
' article ': ' python3.x crawler combat (Get up first hi) ',
' Commentator ': ' Yanbober ',
' Time ': ' 2017-06-14 14:24 ',
' Content ': ' [reply]qq_39168495[/reply]<br> robot '
},
{
' URL ': ' Http://blog.csdn.net/yanbober/article/details/73162298#comments ',
' article ': ' python3.x crawler combat (Get up first hi) ',
' Commentator ': ' Yanbober ',
' Time ': ' 2017-06-14 14:24 ',
' Content ': ' xxxxxxxxxxxx '
},
......
]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
As above is a Python re regular expression written by the crawler parser, of course, this is not strong enough, the real need to parse out the data for cleaning use, there is no longer too much to explain, but you can see directly using regular matching parsing code is more obscure, in addition to the small crawler is not recommended.

3-2 BEAUTIFULSOUP4 Parser

BB Complete Regular matching resolver we can take a sigh of relief, after all, the great Qing is extinct, we also want to abandon the Stone age of the parser, embrace the 21st century BEAUTIFULSOUP4 parser, about this external artifact module we can refer to the official website or the official Chinese document learning.

Install the external module direct command line execution: Pip install Beautifulsoup4

BEAUTIFULSOUP4 is a toolkit that parses a document to provide us with the data we need to crawl, automatically converting the document we entered into Unicode encoding, and converting it to UTF-8 encoding at output. We don't have to worry about fucked-up text parsing encoding (unless the document does not specify the encoding method, in which case the BEAUTIFULSOUP4 cannot automatically identify the encoding, we need to proactively explain how the WEB page is originally encoded).

BEAUTIFULSOUP4 supports some third-party parsers in addition to the HTML parser in the Python standard library, such as LXML, html5lib, etc. (Note: Setting different parsers may result in incorrect formatting for WEB page parsing), To use these third-party parsers you have to install them first, and the installation commands are as follows:

Pip Install lxml
Pip Install Html5lib
1
2
1
2
However, it is still recommended for BEAUTIFULSOUP4 to use LXML as the parser (parsing efficiency is high), the following table lists the main parser pros and cons in official documents (images from official documents):
Write a picture description here

Light said do not practice false bashi, the following gives an analysis of the Login page form form _XSRF and CAPTCHA link for login use, download download down to parse the login interface as follows:

Write a picture description here

Parse the code as follows [the example full source point I view]:

def get_login_xsrf_and_captcha (self):
Try
Url_login = "https://www.zhihu.com/#signin"
Url_captcha = ' http://www.zhihu.com/captcha.gif?r=%d&type=login&lang=cn '% (time.time () * 1000)
Login_content = Self.request_session.get (url_login). Content
Soup = BeautifulSoup (login_content, ' lxml ')
The second parameter of the #find method can also be a python-compiled regular expression
#譬如soup. Find_all ("A", Href=re.compile (r "/item/\w+"))
XSRF = soup.find (' input ', attrs={' name ': ' _XSRF '}) [' Value ']
Captcha_content = Self.request_session.get (url_captcha). Content
return {' xsrf ': xsrf, ' captcha_content ': captcha_content}
Except Exception as E:
Print (' Get login xsrf and Captcha failed! ' +str (e))
Return Dict ()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1
2
3
4
5
6
7
8
9
10
11
12
13
14
How, compared to the regular match is not a lot of readability, at least not so obscure and easy to pit themselves, and the efficiency is higher than the regular, there is no instant feeling from the Stone Age to the intelligent era; it's okay to be unfamiliar with the functions provided by the BEAUTIFULSOUP4 Toolkit. Remember to check their official Chinese documents often, you have to feel lucky, their documents are very concise.

3-3 LXML Parser

After entering the intelligent era, there is a more awesome parser--–lxml, a veritable cock bombing day, about it can be found in the official documents, the goods are written in C language, the resolution faster than BeautifulSoup, the above has introduced the LXML as BeautifulSoup BeautifulSoup usage of built-in parser, credit card collection system Here we give a direct use of the XPath selector with LXML and built-in methods to illustrate this flexible fork parser, the details of the basic knowledge is not within the scope of this series, see the official documentation.

We take the https://www.meitulu.com/US catalogue site as an example, first of all to resolve is the homepage of the recommended model list click the Jump Level two link (the following parse_main_subjects function, that is, in the Class= "IMG" ul The href link of the a tag under Li) is as follows:
Write a picture description here

Then parse into the Level two page (Model large Chart list page, the first page of the page is ddd.html, the other page rules for ddd_index.html), we resolved the model's name and how many photos in total, and then a page by page to parse their high-definition big picture download link.

Parse the code as follows [the example full source point I view]:

Class Htmlparser (object):
def parse_main_subjects (self, content):
‘‘‘
Analysis of the United States Catalogue website Homepage Model category page link
:p Aram Content: US Catalogue homepage Contents
: return: [' big picture page of a model ', ' big picture page of a model ']
‘‘‘
html = etree. HTML (Content.lower ())
Subject = Html.xpath ('//ul[@class = "img"]/li ')
Subject_urls = List ()
For sub in subject:
A_href = sub[0].get (' href ')
Subject_urls.append (A_HREF)
Return Subject_urls

def parse_subject_mj_info (self, content):
‘‘‘
Get model info at the beginning of the Model big picture page
:p Aram Content: A category of Model page contents
: return: {' count ': The model has total figures, ' Mj_name ': Model name}
‘‘‘
html = etree. HTML (Content.lower ())
DIV_CL = Html.xpath ('//div[@class = "c_l"])
Pic_count = Re.search (Re.compile (R '. *?) ( \d+). *? '), Div_cl[0][2].text). Group (1)
return {' count ': Pic_count, ' mj_name ': Div_cl[0][4].text}

def parse_page_pics (self, content):
‘‘‘
Get a Model page for a model big picture download link
:p Aram Content: A category of Model page contents
: return: [' big picture Link ', ' big picture Link ']
‘‘‘
html = etree. HTML (Content.lower ())
Return Html.xpath ('//div[@class = "Content"]/center/img/@src ')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
31
32
33
34
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
31
32
33
34
[The example full source point I view], its parser completely uses the LXML and the XPath syntax, it will help us to go from the United States catalogue website home page to recommend the model two page in turn automatically crawl large map (only crawl high-definition large image) download, log as follows:
Write a picture description here
The resulting crawl resources are as follows (saved by the model name directory, which has been crawled down and downloaded):
Write a picture description here

If you see the above example or do not understand LXml words can suggest you first look at the network of Python LXml tutorial article, and then go to see the official documents on the understanding, but still a sentence, more practice can, actual combat a few you can understand the second.

"Craftsman Joshui Http://blog.csdn.net/yanbober without permission to reprint, please respect the author's labor results." Private Messages Contact Me "

4 Summary

This article mainly continues on a "python3.x crawler combat (first climb up hi)", focusing on crawling crawl static page downloader and parser commonly used routines guide, mainly for understanding the crawler process and their own small crawler program, for the large crawler these introductions are very not robust, We will generally adopt a third-party crawler framework, which will be introduced for frames and dynamic page crawls in our later series.

python3.x crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More