* Original Author: arkteam/xhj, this article belongs to Freebuf Original award scheme, without permission to reprint
A related background
Network crawler (web Spider) also known as network spider, Network robot, is used to automate the collection of Web site data program. If the internet is likened to a spider web, then Spider is a spider crawling up and down the Internet. The network crawler not only can collect the network information for the search engine, but also can serve as the directional information collector, the directional collection certain information under some website, for example: Car ticket price, recruit information, rent information, Weibo comment. Two application Scenarios
Figure 1 Application Scenario
Crawler technology in scientific research, web security, product development, public opinion monitoring and other fields can do a lot of things. For example: In the field of data mining, machine learning, image processing and other scientific research, if there is no data, you can crawl from the Internet through the crawler; in web security, the use of reptiles can be a site for the existence of a loophole for batch verification, use; In product research and development, you can collect the To provide users with the lowest market prices in public opinion monitoring, you can crawl, analysis of Sina Weibo data, so as to identify a user is the navy. three purposes of this article
This article briefly describes the basic knowledge and related techniques needed for directional information gathering, and the libraries associated with this in Python. At the same time to provide data capture related to the library implementation of the package, the purpose is to reduce unnecessary configuration, easy to use, currently only contains the URLLIB2, requests, mechanize encapsulation. Address: Https://github.com/xinhaojing/Crawler four running process
For the crawling of directional information, the crawler mainly includes data fetching, data parsing, data warehousing and other operation flow. which
(1) Data capture: Send the constructed HTTP request, obtain the HTTP response containing the required data;
(2) Data parsing: The original data of HTTP response analysis, cleaning to extract the required data;
(3) Data warehousing: Further save the data to the database (or text file), build the Knowledge base.
Figure 2.1 Basic running Process
Figure 2.2 Detailed running process five related technologies
The techniques associated with reptiles include:
(1) Data capture: Understand the HTTP request and the meaning of the fields in the response, understand the relevant network analysis tools, mainly used to analyze network traffic, such as: Burpsuit. In general, use the browser's developer model;
(2) Data parsing: Understand HTML structure, JSON and XML data format, CSS selector, XPath path expression, regular expression, etc., in order to extract the required data from the response;
(3) Data warehousing: Mysql,sqlite, Redis and other databases, easy to store data;
Figure 3 Related Technologies
The above is the basic requirements of learning reptiles, in practical applications, we should also consider how to use multithreading to improve efficiency, how to do task scheduling, how to deal with the reverse crawler, how to achieve distributed crawler and so on. This article introduces the relatively limited, for reference only. six Python related libraries
In addition to the Scrapy framework, Python has a number of related libraries to use in the implementation of the crawler. Among them, in the data crawl aspect includes: Urllib2 (URLLIB3), requests, mechanize, selenium, Splinter, in the data analysis side includes: lxml, BEAUTIFULSOUP4, Re, pyquery.
for data capture , the main process involved is to simulate the browser to send a structured HTTP request to the server, the common types are: Get/post. Among them, Urllib2 (URLLIB3), requests, mechanize are used to get the original response content of the URL, and selenium, splinter by loading browser driver, get the response content after browser rendering, the simulation degree is higher.
What kind of library to choose, should be based on the actual needs, such as consideration of efficiency, the other side of the anti-reptile means. Typically, you can use URLLIB2 (URLLIB3), requests, mechanize, and so on to avoid selenium, splinter, because the latter is less efficient because of the need to load browsers.
for data parsing , mainly from the response page to extract the required data, commonly used methods are: XPath path expression, CSS selector, regular expression, and so on. Among them, XPath path expressions, CSS Selectors are primarily used to extract structured data, while regular expressions are primarily used to extract unstructured data. The corresponding library has lxml, BEAUTIFULSOUP4, Re, pyquery.
Table 1 Related library documents
|
Class library |
document |
Data crawl |
urllib2 |
https://docs.python.org/2/library/urllib2.html |
requests |
http://cn.python-requests.org/zh_CN/latest |
mechanize |
https://mechanize.readthedocs.io/en/latest/ |
splinter |
Http://splinter.rea dthedocs.io/en/latest/ |
Selenium |
https://selenium-python.readthedocs.io/ |
Data resolution |
lxml |
http://lxml.de/ |
Beautifulsou P4 |
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html http://cuiqingcai.com/1319.html |
re |
http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html |
Pyquery |
https://pythonhosted.org/pyquery/ |
Seven. Related introduction 1 Data Capture (1) urllib2
URLLIB2 is a library of Python's own to access Web pages and local files, and is often used with urllib. Because Urllib provides a UrlEncode method for encoding the data that is sent, URLLIB2 does not have a corresponding method.
The following is a description of the simple package of URLLIB2, which focuses on the related features in a class function, avoiding some cumbersome configuration work.
Figure 4 Urllib2 Package Description (2) requests and mechanize
Requests is a third-party library of Python, based on Urllib, but more convenient and easy to interface than Urllib. Features include, about HTTP requests: support for custom request headers, support for setting up agents, support for redirection, support for holding sessions [request. Session ()], support timeout setting, automatic urlencode of post data; about HTTP response: You can get detailed data directly from the response without human configuration, including: status code, automatic decoding response content, individual fields in the response header ; also built-in JSON decoder.
Mechanize is a urllib2 part of the function of the replacement, to better simulate the browser behavior, in the Web Access control to do a comprehensive. Its features include: Support cookie settings, proxy settings, redirection settings, simple form filling, browser history and overloading, Referer header addition (optional), automatic compliance robots.txt, Automatic processing http-equiv and refresh, etc.
The interface to the requests and mechanize simple encapsulation is the same as the URLLIB2, which focuses on a class function that is not repeated here, and can refer to the given code. (4) Splinter and selenium
Selenium (python) and splinter can simulate browser behavior very well, both by loading browser-driven work. In the collection of information, reduce the problem of analyzing network requests, generally only need to know the corresponding URL of the data page. The efficiency is relatively low because you want to load the browser.
By default, the Firefox browser is the preferred use. The Chrome and PANTOMJS (headless browser)-driven download addresses are listed here for easy searching.
Chrome and PANTOMJS driver addresses:
chrome:http://chromedriver.storage.googleapis.com/index.html?path=2.9/
pantomjs:http://phantomjs.org/download.html 2 Data parsing
For data parsing, the available libraries are lxml, BEAUTIFULSOUP4, Re, pyquery. Among them, beautifulsoup4 more commonly used. In addition to the use of these libraries, learn about XPath path expressions, CSS selectors, regular expression syntax, and easy extraction of data from Web pages. The Chrome browser has the ability to generate XPath itself.
Figure 5 Chrome View XPath for elements
If you can crawl to the desired data based on network analysis, then the work of extracting data from the page is relatively clear. The specific use method can refer to the documentation, which is not described in detail here. Eight anti-reptile
1. Basic anti-reptile means, mainly detects the field in the request header, for example: User-agent, Referer, etc. In this case, just bring the corresponding field in the request. The fields of the constructed HTTP request are best made exactly the same as they were sent in the browser, but not necessarily.
2. Based on the user behavior of the anti-reptile means, mainly in the background of the access to the IP (or user-agent) statistics, when more than a set threshold, give a blockade. In this case, you can use a proxy server to resolve, every few times, switch the IP address of the agent used (or by using the User-agent list, each time from the table randomly select a use). Such an anti-reptile approach could hurt a user.
3. Want to crawl data is if through the AJAX request, if through network analysis can find the AJAX request, also can analyze the request of the specific parameters, then directly simulate the corresponding HTTP request, you can get the corresponding data from the response. This is no different from a normal request.
4. JavaScript-based Anti-crawler, primarily in response to data pages, first return a page with JavaScript code to verify that the visitor has JavaScript execution environment to determine whether the browser is used.
Normally, this JS code is executed, will send a request with parameter key, backstage by Judge key value to decide is to respond to real page, or respond to false or wrong page. Because key parameters are generated dynamically, each time it is different, it is difficult to parse out its generation method so that the corresponding HTTP request cannot be constructed.
For example, the website http://www.kuaidaili.com/, use is this way, the concrete can see https://www.v2ex.com/t/269337.
When you first visit a Web site, the JS content of the response sends a request with a Yundun parameter, and the Yundun parameter is different each time.
Fig. 6 Dynamic parameter Yundun
At the time of the current test, when the JavaScript code executes, the request sent is no longer with the Yundun parameter, but dynamically generates a cookie, which is similar to the Yundun parameter in subsequent requests.
Figure 7 Dynamic Cookies
For such an approach, reptiles need to be able to parse the execution JavaScript by using selenium or splinter, which can be implemented by loading the browser.
More detailed anti-reptile technology and response methods can refer to:
1.https://zhuanlan.zhihu.com/p/20520370
2.https://segmentfault.com/a/1190000005840672
3.http://v.qq.com/page/j/o/t/j0308hykvot.html Nine Reference
[1] http://www.test404.com/post-802.html
[2] http://blog.csdn.net/shanzhizi/article/details/50903748
[3] Http://blog.chinaunix.net/uid-28930384-id-3745403.html
[4] http://blog.csdn.net/cnmilan/article/details/9199181
[5] https://zhuanlan.zhihu.com/p/20520370
[6] https://segmentfault.com/a/1190000005840672
[7] https://www.v2ex.com/t/269337
* Original Author: arkteam/xhj, this article belongs to Freebuf Original award scheme, without permission to reprint Arkteam 32 article rank: 6 | | Previous: New attack using the Web environment light sensor to steal sensitive information from the browser (including demo video) Next: Kali Linux released a new version, together to see what new features it has 4 comments Delectate (Level 2) 2017-0 5-02 back to 1 floor
Pantomjs, Phantomjs silly confused, less than H don't know. Light Up (3) Laba 2017-05-02 reply
@ delectate scrapy and scapy ... Light Up (1) anonymous 2017-05-02 reply
@ Laba Which one is it? (0) Cat-eating Fish (Level 1) 2017-05-03 back to 2 floor
The landlord's sublime text interface is very beautiful, want to ask what color and font used. Light Up (0)