The analysis of crawler technology

Source: Internet
Author: User
Tags xpath

http://drops.wooyun.org/tips/3915

0x00 Preface

Web crawler (crawler), is a "automated browsing network" of the program, or is a network robot. They are widely used in Internet search engines or other similar sites to obtain or update the content and retrieval methods of these sites. They can automatically capture all of the page content they can access, so that the program does the next step of processing.

In the WEB2.0 era, dynamic Web pages prevailed. Then the crawler should be able to crawl within the page to these JavaScript-generated links. Of course, the dynamic parsing page is just a technical point of the crawler. Below, I'll share some personal experience with these in the following order (the programming language is Python).

1, crawler architecture.

2, page download and parsing.

3,url to the weight method.

4,url similarity algorithm.

5, concurrent operation.

6, data storage

7, dynamic crawler source sharing.

8, reference article

0X01 Crawler Architecture

When it comes to reptile architecture, it is Scrapy's reptile architecture that has to be mentioned. Scrapy, a fast, high-level crawler framework developed by Python, is used to crawl Web sites and extract structured data from pages. Scrapy can be used for data mining, monitoring and automated testing in a wide range of applications. The attraction of Scrapy is that it is a framework that anyone can easily modify as needed. It also provides a variety of types of crawler base classes, such as Basespider, Sitemap crawler.

is the Scrapy architecture diagram, the Green line is the data flow, first starting from the initial URL, Scheduler will give it to Downloader to download, download will be given to the Spider for analysis, the need to save the data will be sent to the item Pipeline, That's the post-processing of the data. In addition, in the data flow channel can also install a variety of middleware, to do the necessary processing. Therefore, in the development of the crawler, it is best to first plan the various modules. My approach is to separately plan the download module, crawl module, scheduling module, data storage module.

0x02 page Download and analysis Page download

Page download is divided into static and dynamic two download methods.

Traditional crawler use of static download, the advantage of static download is fast, but the page is just a dull html, so the page link analysis is only the < a > tag href attribute or the master can analyze their own js,form and other tags to capture some links. In Python, you can use the URLLIB2 module or the requests module for functionality. Dynamic crawlers have a special advantage in the web2.0 era, because Web pages are processed using JavaScript, and Web content is asynchronously acquired through Ajax. So, dynamic crawlers need to analyze the pages after JavaScript processing and Ajax fetching content. At present, the simple solution is to deal directly with the module based on WebKit. The three modules of PYQT4, splinter and selenium can be used to achieve this goal. For crawlers, the browser interface is not required, so using a headless browser is very cost effective, htmlunit and PHANTOMJS are available headless browser.

The above code is to access the Sina main station. By comparing the static crawl page and dynamic crawl page length and contrast static crawl page and dynamically crawl the number of links crawled within the page.

In a static crawl, the length of the page is 563838, and the number of links crawled within the page is only 166. In the dynamic crawl, the length of the page increased to 695991, and the number of links reached 1422, with nearly 10 times times the increase.

Grab link expression

Regular: Re.compile ("href=\" ([^\ "]*) \" ")

Xpath:xpath ('//*[@href] ')

Page parsing

Page parsing is a module that implements crawling of links within a page and captures specific data, and page parsing is primarily about processing strings, while HTML is a special string that can be solved in Python, such as Re, BeautifulSoup, Htmlparser, lxml, and so on. For links, the main fetch is the HREF attribute under the A tag, as well as the SRC attribute of some other tags.

0x03 URL de-weight

URL deduplication is a key step in the crawler's operation, since the running crawlers are largely blocked in network interactions, so it is important to avoid repetitive network interactions. Crawlers will generally put the URL to be crawled in a queue, from the crawled Web page to extract the new URL, before they are placed in the queue, the first to make sure that the new URLs have not been crawled, if it has been crawled before, no longer into the queue.

Hash table

Using hash table to do deduplication is generally the easiest way to think, because the time complexity of hash table query is O (1), and in the case of large enough hash table, the probability of hash conflict becomes very small, so the accuracy of the evaluation of the URL repetition is very high. The use of the hash table is a relatively simple solution. However, the ordinary hash table also has obvious flaws, in the case of memory, the use of a large hash table is inappropriate. The data structure of a dictionary can be used in Python.

URL compression

If the hash table, when each node is stored in a str form of a specific URL, is very memory-intensive, if the URL is compressed into an int variable, memory footprint of more than 3 times times the size of the reduction. Therefore, you can use Python's hashlib module for URL compression. Idea: The data structure of the node of the hash table is set to the collection, and the compressed URL is stored in the collection.

Bloom Filter

Bloom filter is a very small amount of errors in exchange for a significant savings in storage space. Bloom Filter is a set of K-defined hash Function on n input key to map the N key above to the data container on the M-bit.

It is clear that the advantage of the bloom filter, in the controllable container length, all the hash function to the same element to calculate the hash value of 1 o'clock, to determine the existence of this element. Python in Hashlib, with a variety of hash functions, with md5,sha1,sha224,sha256,sha384,sha512. The code can also be added to salt processing, or very convenient. Bloom Filter also has a conflict case, see the article at the end of the reference article.

In Python programming, you can use the Bloomfilter interface provided by Jaybaird, or build your own wheels.

Small details

With a small detail, it is important to choose a container when creating a hash table. Hash table occupies too much space is a very uncomfortable problem, so for the crawler to heavy, the following methods can solve some problems.

The above code simply verifies the run time of the build container.

As you can see, when a container of length 100 million is established, it takes 7.2s to select the run time of the list container program, and when the string is selected as a container, it takes 0.2s of elapsed time.

Next look at the memory usage.

If you set up a 100 million list, it takes up 794660k of memory.

The creation of a 100 million-length string takes up 109720k of memory and consumes about 700000k less space.

0x04 URL Similarity

Primary algorithm

For URL similarity, I just practice a very simple method.

It is also necessary to judge a similar URL in the event that no repeated crawls are guaranteed. I used the ideas provided by sponge and ly5066113. Specific information in the reference article.

The following is a set of URL groups that can be judged as similar

Http://auto.sohu.com/7/0903/70/column213117075.shtml

Http://auto.sohu.com/7/0903/95/column212969565.shtml

Http://auto.sohu.com/7/0903/96/column212969687.shtml

Http://auto.sohu.com/7/1103/61/column216206148.shtml

Http://auto.sohu.com/s2007/0155/s254359851/index1.shtml

Http://auto.sohu.com/s2007/5730/s249066842/index2.shtml

Http://auto.sohu.com/s2007/5730/s249067138/index3.shtml

Http://auto.sohu.com/s2007/5730/s249067983/index4.shtml

As expected, these URLs should be merged to

Http://auto.sohu.com/7/0903/70/column213117075.shtml

Http://auto.sohu.com/s2007/0155/s254359851/index1.shtml

The idea is as follows, need to extract the following features

1,host string

2, directory depth (with '/' split)

3, last feature

Specific algorithms

The algorithm itself is very vegetable, you can understand a look.

Actual effect:

Shows the 8 different URLs, the 2 values are calculated. In practice, in a Tens hash table, the conflict situation is acceptable.

0X05 Concurrent Operations

The main models involved in concurrent operations in Python are: multithreaded model, multi-process model, and co-model. Elias specifically wrote an article to compare the performance of several commonly used model concurrency scenarios. For the crawler itself, the limit crawler speed is mainly from the target server response speed, so choose a control up the module is right.

Multithreaded model

Multithreaded model, is the most easy to get started, Python threading module can be a good implementation of concurrency requirements, with the queue module to achieve shared data.

Multi-process Model

Similar to multi-process models and multithreaded models, there are similar queue modules in the multiprocessing module for data sharing. In Linux, the process of user-State can take advantage of multi-core, so it can solve the problem of crawler concurrency in multi-core background.

Co-process Model

In Elias's article, the performance of a Greenlet-based program is next to Stackless Python, which is roughly one-fold slower than stackless Python, and is nearly an order of magnitude faster than other schemes. As a result, concurrent programs based on Gevent (which encapsulate Greenlet) have a good performance advantage.

Specify the following gevent (non-blocking asynchronous IO). , "Gevent is a co-based Python network library that uses the high-level synchronization API provided by Greenlet that encapsulates the Libevent event loop. ”

From the actual programming effect, the co-process model does very well, the controllability of the running result is obviously strong, the Gevent library's encapsulation is very easy to use.

0X06 Data storage

The data store itself is designed with a lot of technology, as a side dish, but there are some small experiences that can be shared.

Premise: The use of relational database, the test is the choice of MySQL, and other similar sqlite,sqlserver ideas no difference.

When we do data storage, the goal is to reduce the interaction with the database, which can improve performance. Typically, whenever a URL node is read, a data store is stored for an infinite loop of such logic. In fact, such a performance experience is very poor, storage speed is very slow.

Advanced approach, in order to reduce the number of interactions with the database, each with the database from the previous Transfer 1 nodes into the transfer of 10 nodes, to the transfer of 100 node content, so that the efficiency of 10 times times to 100 times times the increase in practical applications, the effect is very good. :D

0x07 Dynamic crawler source sharing

Reptile model

At present, the crawler model, such as the dispatch module is the core module. The scheduling module is separate from the download module, the extraction module, the storage module shares three queues, and the download module shares a queue with the extraction module. The data is passed in the direction shown.

Crawler source

The following features are implemented:

Dynamic download

Gevent processing

Bloomfilter Filtration

URL Similarity filtering

Keyword filtering

Crawl Depth

GitHub Address: Https://github.com/manning23/MSpider

The code overall is not very difficult, you light spray.

0X08 Reference Articles

Thanks for sharing the following articles and discussions

Http://security.tencent.com/index.php/blog/msg/34 http://www.pnigos.com/?p=217

HTTP://SECURITY.TENCENT.COM/INDEX.PHP/BLOG/MSG/12 http://wenku.baidu.com/view/7fa3ad6e58fafab069dc02b8.html

Http://wenku.baidu.com/view/67fa6feaaeaad1f346933f28.html

http://www.html5rocks.com/zh/tutorials/internals/howbrowserswork/

Http://www.elias.cn/Python/PyConcurrency?from=Develop.PyConcurrency

http://blog.csdn.net/HanTangSongMing/article/details/24454453

http://blog.csdn.net/historyasamirror/article/details/6746217 http://www.spongeliu.com/399.html

http://xlambda.com/gevent-tutorial/http://simple-is-better.com/news/334

http://blog.csdn.net/jiaomeng/article/details/1495500 Http://bbs.chinaunix.net/forum.php?mod=viewthread&tid =1337181

Http://www.tuicool.com/articles/nieEVv http://www.zhihu.com/question/21652316 Http://code.rootk.com/entry/crawler

The analysis of crawler technology

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.