Crawler summary (i)--Crawler Basics & Python implementation

Source: Internet
Author: User
Tags file url urlencode xpath webfile

Reptiles are often used in peacetime, but there has been no systematic summary, in fact, it involves a lot of knowledge points. This series of these knowledge points, do not seek exhaustive, only hope that vitalize build a knowledge of the crawler framework. This is a conceptual explanation as well as an entry-level crawler introduction (for example, to crawl NetEase news).
Reptile Basics What is a reptile?

The reptile is actually the process of acquiring resources. The overall production of reptiles is divided into three steps: Crawl-take-save. First get all the content of the entire page, then take out the parts that are useful to you, and then save the useful ones. crawler type web crawler
Web crawler is a program or script that automatically crawls the World Wide Web information according to certain rules. Web crawler is a very important part of search engine system, crawling Web page information is used to build index to provide support for search engine, it determines whether the content of the whole engine system is rich, whether the information is instantaneous, and its performance affects directly the effect of search engine. Traditional Reptiles
Starting from the URL of one or several initial web pages, get the URL of the initial Web page and, in the process of crawling the page, constantly extract the new URL from the current page and put it into the queue until a certain stop condition of the system is satisfied. working principle according to a certain Web page analysis algorithm to filter topics unrelated to the link, keep a useful link and put it into the waiting to crawl the URL queue based on a certain search policy select the next page URL to crawl from the queue, repeat the process until you reach the specified criteria to end the crawl Analyze, filter, and index all crawled Web pages for subsequent queries and retrieval. Crawl Policy Breadth First

Complete the current level of search before the next level of search. General use strategy, generally through the queue to achieve. Best Priority

There will be an evaluation algorithm, usually evaluated as a useful Web page, first to crawl. Depth First

Few practical applications. May cause trapped problems. Through the stack to achieve. URL (Uniform Resource Locator: Uniform Resource Locator)

Resources on the Internet have their own unique address, composed of three parts. Example of the resource location on the IP address and port number host for the mode/protocol file: http://www.example.com/index.html Web Server/socket How to establish connections and transmit data

Web server work is actually the same as the process of making a call (Buy phone –> registration number –> monitor –> queue to listen to –> Read and write –> closed), the classic three-step handshake (someone there?). I'm here, what about you. I am also in the queue to listen to. The following picture is enough to explain everything.

Crawler end requires a socket interface, to the server to initiate the connect request, the completion of the connection can be communicated with the server, the operation will close the socket interface. The server side is more complex and requires a socket interface, and the socket interface needs to bind an address (bind ()), which is equivalent to having a fixed phone number so that other people can dial this number to find the server. After binding, the server's sockets start listening (listen ()) There is no user request, if there is, to receive the request (accept ()), and the user to establish a connection, and then can communicate. HTML DOM DOM defines the standard for accessing and manipulating HTML documents by expressing HTML documents as a tree structure

cookies are generated by the server side, sent to user-agent (typically the browser), and the browser saves the cookie's key/value to a text file in a directory and sends the cookie to the server the next time it accesses the same Web site. HTTP Get is accessed directly as a link, and the link contains all the parameters put the submitted data into the package of the HTTP package

eg. Import urllib Import urllib2 url= ' http://www.zhihu.com/#signin ' user_agent= ' mozilla/5.0 ' values={' username ': ' 252618408@qq.com ', ' password ': ' xxx '} headers={' user-agent ': user_agent} data=urllib.urlencode (values) # UrlEncode is Urllib unique method of REQUEST=URLLIB2. Request (Url,data,headers) # Write a letter Response=urllib2.urlopen (request) # Send the letter and get the reply page=resp Onse.read () # Read the reply 

Urllib can only accept URLs, which means that you cannot disguise your User Agent string, etc., but Urllib provides a UrlEncode method for get query string generation, and Urllib2 does not.
So urllib, urllib2 often used together. Headers Settings User-agent: This value is used by some servers or proxies to determine whether the request is made by the browser Content-type: When using the REST interface, the server checks the value to determine how the content in the HTTP body resolves the application/xml: Use Application/json in XMl RPC, such as Restful/soap calls: Using application/x-www-form-urlencoded when JSON RPC calls: using > Reptile difficulties

Two parts of the reptile, one is to download the Web page, there are many problems to consider. How to maximize the use of local bandwidth, how to schedule Web requests for different sites to reduce the burden on the other server and so on. In a high-performance Web Crawler system, DNS queries can also be a bottleneck for optimization, and some "guild rules" need to be followed (for example, robots.txt).
and access to the Web page after the analysis process is very complex, the Internet things strange, all kinds of error-riddled HTML pages have, to all the analysis is almost impossible; In addition, with the popularity of AJAX, how to Get by Javascript Dynamically generated content is a big problem; In addition, there are all kinds of intentional or unintentional Spider traps on the Internet, and if you blindly follow hyperlinks, you'll end up in a Trap, such as this site, which is said to have previously claimed Google's Internet The number of Unique URLs has reached 1 trillion, so this person is proud to announce the second trillion. The simplest of reptiles Requests Library

Import requests
URL = "http://shuang0420.github.io/"
r = Requests.get (URL)
URLLIB2 Library
1 2 3 4 5 6 7 8 9 10 11
Import URLLIB2 # request source File URL = "http://shuang0420.github.io/" request = Urllib2. Request (URL) # Write a letter response = Urllib2.urlopen (Request) # Send the "letter" and get the reply page = Response.read () # Read the Reply # save source file WebFile = open (' webpage.html ', ' WB ') webfile.write (page) webfile.close ()

This is a simple crawler, open webpage.html is such a display, no CSS.
Example: Crawl NetEase news

Crawl NetEase News [code example]
– Use the URLLIB2 requests package to crawl the page
– Use Xpath to parse level two pages using regular expressions and BS4 parsing pages
– The resulting title and link, saved as a local file analysis initial page

Our initial page is Http://news.163.com/rank.

View Source code

What we want is the category title and URL, which requires parsing the DOM document tree, where the BeautifulSoup method is used.

 
1 2 3 4 5 6 7 8 9
 
def nav_info (myPage): # Level two navigation title and page PageInfo = Re.findall (R ' <div class= "Subnav" >.*?<div class= "area Areabg1 ">", MyPage, re. S) [0].replace (' <div class= ' Subnav ' > '). replace (' <div class= ' area areabg1 ' > ', ') soup = BeautifulSoup ( PageInfo, "lxml") tags = soup (' a ') topics = [] for tag in tags: # as long as technology, finance, sports News # if (tag.string== ' technology ' or tag.string== ') By ' or tag.string== ' sports '): Topics.append ((tag.string, Tag.get (' href ', None)) return topics

However, beautiful soup does not parse the document faster than the parser it relies on, and if the time required for computing is high or the computer's time is more valuable than the programmer's time, then the lxml should be used directly. In other words, there are ways to improve the efficiency of beautiful soup, using lxml as a parser. Beautiful soup use lxml as a parser faster than using Html5lib or Python built-in parsers. The default parser for BS4 is Html.parser, using the lxml code as follows:

BeautifulSoup (markup, "lxml")
Analyze level two pages

View Source code

We want to crawl is between the news headlines and links, also need to parse the document tree, can be implemented through the following code, where the use of lxml parser, more efficient.

1 2 3 4 5 6
def news_info (newpage): # XPath uses a path expression to select a node or node set dom = Etree in a document. HTML (newpage) news_titles = Dom.xpath ('//tr/td/a/text () ') News_urls = Dom.xpath ('//tr/td/a/@href ') return Zip (news_ Titles, News_urls)

Full code potential problem

Our task is to crawl 10,000 pages, according to the above program, time-consuming, we can consider to open multiple threads (pool) to crawl together, or with a distributed architecture to the concurrent crawl page.

Both the seed URL and the URL to be parsed are placed in a list, and we should design a more reasonable data structure to store the URLs to be crawled, such as queues or priority queues.

For each website URL, we are equal, in fact, we should treat differently. The priority principle of good stations should be considered.

Each time we initiate a request, we initiate a request based on a URL that involves DNS resolution and translates the URL into an IP address. A web site is usually made up of thousands of URLs, so we can consider caching the IP addresses of the domain names of these sites to avoid having to initiate DNS requests every time, which is time-consuming and laborious.

After parsing the URLs in the Web page, we did nothing to redo them and put them all in the list to crawl. In fact, there may be many links that are duplicated and we do a lot of repetitive work.

optimization Scheme for the problem of reptile being banned

Parallel crawl problem

On the parallel crawl, the first thing we think of is multithreading or thread pool, a reptile program to open multiple threads inside. The same machine opens multiple crawler programs, so that we have more than n crawl threads working at the same time, greatly improving efficiency.

Of course, if we want to crawl the task is very much, a machine, a dot is certainly not enough, we must consider the distributed crawler. Distributed architecture, there are many issues to consider, we need a scheduler to assign tasks and order, each crawler also need communication cooperation, work together to complete the task, do not crawl the same Web page repeatedly. We also need to consider load balancing to be fair when assigning tasks. (hash can be done by hash, for example, based on site domain name)

Load balance After assigning the task, do not think everything is all right, in case which machine hangs. To whom the task assigned to the machine that was dropped was originally assigned. Or how many machines to add each day, and how the task should be reassigned. So we're going to need a task table to record the status.

To crawl a page queue
How to treat a queue to be crawled is a similar scenario to how the operating system schedules the process.
Different sites have different levels of importance, so you can design a priority queue to store the links to the pages you want to crawl. In this way, each time we crawl, we first crawl the important pages.
Of course, you can also follow the process scheduling strategy of the operating system multistage feedback queue scheduling algorithm.

DNS Cache
To avoid having to initiate DNS queries every time, we can cache DNS. The DNS cache is of course designed with a hash table to store the existing domain name and its IP.

Page to go heavy
When it comes to Web pages, the first idea is spam filtering. A classic solution for spam filtering is the Bloom filter (Prum filter). The Prum filter principle is simply to create a large bit array and then hash the same URL with multiple hash functions to get multiple digits, and then place the digits in the bit array at 1. Next time you come to a URL, the same is done with a number of hash function hash, get more than one number, we only need to judge the number of bits in the corresponding numbers are all 1, if the full 1, then this URL has appeared. This completes the problem of the URL to go heavy. Of course, this method will have errors, as long as the error in our tolerance range, such as 10,000 pages, I only climbed 9,999, and will not have much real impact.
A very good method from the URL similarity calculation, simple introduction.
Considering the structure of the URL itself, the computation of its similarity can be abstracted as the calculation of the similarity of its key features. For example, the site can be abstracted as a one-dimensional feature, the depth of the directory abstraction as a one-dimensional feature, a level of directory, level two directory, the name of the tail page can also be abstracted as a one-dimensional feature. For example, the following two URLs:
Url1:http://www.spongeliu.com/go/happy/1234.html
Url2:http://www.spongeliu.com/snoopy/tree/abcd.html

Features: Site features: If two URLs are the same, the feature takes a value of 1, otherwise a value of 0; directory depth features: The feature value is two URLs of the directory depth is consistent; first-level directory features: In this dimension features, you can use a variety of methods, such as if the first level of the same name of the same feature to take 1, or 0 Or a characteristic value based on the edit distance of the directory name, or according to the pattern of the directory name, such as whether the number, the letter, or whether the alphanumeric is interspersed. It depends on the specific needs, the example here is just 1 and 0 based on whether the directory name is the same, the tail page feature: This dimension feature is taken with the same level of directory, you can determine whether the suffix is the same, whether the digital page, whether the machine generated random strings or based on the length of the edit value, and specifically depends on Here The example simply determines whether the last level of the directory is consistent (for example, whether it is made up of numbers, whether there are letters, etc.).

In this way, you get 4 dimensions for these two URLs: 1 1 0 0. With these two feature combinations, we can judge whether the similarity is based on the specific requirements. We define how important each feature is and give a formula:

Similarity = Feather1 * x1 + feather2*x2 + feather3*x3 + feather4*x4

where x represents the importance of the corresponding features, such as I think the site and the directory is not important, the last page features is the most important, then x1,x2,x3 can take a value of 0,x4 value of 1, so according to the similarity can be drawn is similar. Or think the importance of the site accounted for 10%, directory depth accounted for 50%, tail page features accounted for 40%, then the coefficient of the value of 0.1\0.5\0\0.4 can be respectively.

In fact, the need to find the characteristics of this problem can be simplified into a machine learning problem, only need to artificially determine a number of URLs are similar, with SVM training can achieve the purpose of machine judgments.
In addition to the above two URL similarity judgments, you can also abstract each URL into a set of characteristics, and then calculate a URL score, set a score difference threshold, you can reach from a large number of URLs to find similar URLs.

Problems with data storage
Data storage is also a very technical problem. It is great article to use relational database access or NoSQL, or to design a specific file format for storage.

Inter-process communication
Distributed crawler, it must be inseparable from the communication between processes. We can interact with data in the required data format to complete interprocess communication.

The problem of anti-reptile mechanism
For the anti-reptile mechanism, we can solve the problem by rotating the IP address, rotating the cookie, modifying the user agent, restricting the speed and avoiding the repetitive crawling mode.

Reference Links:

Basic principles of Web crawler (i.)
http://www.chinahadoop.cn/course/596/learn#lesson/11986
Https://www.bittiger.io/blog/post/5pDTFcDwkmCvvmKys
Https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html


Original address: http://www.shuang0420.com/2016/06/11/%E7%88%AC%E8%99%AB%E6%80%BB%E7%BB%93%EF%BC%88%E4%B8%80%EF%BC%89/

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.