http://drops.wooyun.org/tips/3915
0x00 Preface
Web crawler (crawler), is a "automated browsing network" of the program, or is a network robot. They are widely used in Internet search engines or other similar sites to obtain or update the content and retrieval methods of these sites. They can automatically capture all of the page content they can access, so that the program does the next step of processing.
In the WEB2.0 era, dynamic Web pages prevailed. Then the crawler should be able to crawl within the page to these JavaScript-generated links. Of course, the dynamic parsing page is just a technical point of the crawler. Below, I'll share some personal experience with these in the following order (the programming language is Python).
1, crawler architecture.
2, page download and parsing.
3,url to the weight method.
4,url similarity algorithm.
5, concurrent operation.
6, data storage
7, dynamic crawler source sharing.
8, reference article
0X01 Crawler Architecture
When it comes to reptile architecture, it is Scrapy's reptile architecture that has to be mentioned. Scrapy, a fast, high-level crawler framework developed by Python, is used to crawl Web sites and extract structured data from pages. Scrapy can be used for data mining, monitoring and automated testing in a wide range of applications. The attraction of Scrapy is that it is a framework that anyone can easily modify as needed. It also provides a variety of types of crawler base classes, such as Basespider, Sitemap crawler.
is the Scrapy architecture diagram, the Green line is the data flow, first starting from the initial URL, Scheduler will give it to Downloader to download, download will be given to the Spider for analysis, the need to save the data will be sent to the item Pipeline, That's the post-processing of the data. In addition, in the data flow channel can also install a variety of middleware, to do the necessary processing. Therefore, in the development of the crawler, it is best to first plan the various modules. My approach is to separately plan the download module, crawl module, scheduling module, data storage module.
0x02 page Download and analysis Page download
Page download is divided into static and dynamic two download methods.
Traditional crawler use of static download, the advantage of static download is fast, but the page is just a dull html, so the page link analysis is only the < a > tag href attribute or the master can analyze their own js,form and other tags to capture some links. In Python, you can use the URLLIB2 module or the requests module for functionality. Dynamic crawlers have a special advantage in the web2.0 era, because Web pages are processed using JavaScript, and Web content is asynchronously acquired through Ajax. So, dynamic crawlers need to analyze the pages after JavaScript processing and Ajax fetching content. At present, the simple solution is to deal directly with the module based on WebKit. The three modules of PYQT4, splinter and selenium can be used to achieve this goal. For crawlers, the browser interface is not required, so using a headless browser is very cost effective, htmlunit and PHANTOMJS are available headless browser.
The above code is to access the Sina main station. By comparing the static crawl page and dynamic crawl page length and contrast static crawl page and dynamically crawl the number of links crawled within the page.
In a static crawl, the length of the page is 563838, and the number of links crawled within the page is only 166. In the dynamic crawl, the length of the page increased to 695991, and the number of links reached 1422, with nearly 10 times times the increase.
Grab link expression
Regular: Re.compile ("href=\" ([^\ "]*) \" ")
Xpath:xpath ('//*[@href] ')
Page parsing
Page parsing is a module that implements crawling of links within a page and captures specific data, and page parsing is primarily about processing strings, while HTML is a special string that can be solved in Python, such as Re, BeautifulSoup, Htmlparser, lxml, and so on. For links, the main fetch is the HREF attribute under the A tag, as well as the SRC attribute of some other tags.
0x03 URL de-weight
URL deduplication is a key step in the crawler's operation, since the running crawlers are largely blocked in network interactions, so it is important to avoid repetitive network interactions. Crawlers will generally put the URL to be crawled in a queue, from the crawled Web page to extract the new URL, before they are placed in the queue, the first to make sure that the new URLs have not been crawled, if it has been crawled before, no longer into the queue.
Hash table
Using hash table to do deduplication is generally the easiest way to think, because the time complexity of hash table query is O (1), and in the case of large enough hash table, the probability of hash conflict becomes very small, so the accuracy of the evaluation of the URL repetition is very high. The use of the hash table is a relatively simple solution. However, the ordinary hash table also has obvious flaws, in the case of memory, the use of a large hash table is inappropriate. The data structure of a dictionary can be used in Python.
URL compression
If the hash table, when each node is stored in a str form of a specific URL, is very memory-intensive, if the URL is compressed into an int variable, memory footprint of more than 3 times times the size of the reduction. Therefore, you can use Python's hashlib module for URL compression. Idea: The data structure of the node of the hash table is set to the collection, and the compressed URL is stored in the collection.
Bloom Filter
Bloom filter is a very small amount of errors in exchange for a significant savings in storage space. Bloom Filter is a set of K-defined hash Function on n input key to map the N key above to the data container on the M-bit.
It is clear that the advantage of the bloom filter, in the controllable container length, all the hash function to the same element to calculate the hash value of 1 o'clock, to determine the existence of this element. Python in Hashlib, with a variety of hash functions, with md5,sha1,sha224,sha256,sha384,sha512. The code can also be added to salt processing, or very convenient. Bloom Filter also has a conflict case, see the article at the end of the reference article.
In Python programming, you can use the Bloomfilter interface provided by Jaybaird, or build your own wheels.
Small details
With a small detail, it is important to choose a container when creating a hash table. Hash table occupies too much space is a very uncomfortable problem, so for the crawler to heavy, the following methods can solve some problems.
The above code simply verifies the run time of the build container.
As you can see, when a container of length 100 million is established, it takes 7.2s to select the run time of the list container program, and when the string is selected as a container, it takes 0.2s of elapsed time.
Next look at the memory usage.
If you set up a 100 million list, it takes up 794660k of memory.
The creation of a 100 million-length string takes up 109720k of memory and consumes about 700000k less space.
0x04 URL Similarity
Primary algorithm
For URL similarity, I just practice a very simple method.
It is also necessary to judge a similar URL in the event that no repeated crawls are guaranteed. I used the ideas provided by sponge and ly5066113. Specific information in the reference article.
The following is a set of URL groups that can be judged as similar
Http://auto.sohu.com/7/0903/70/column213117075.shtml
Http://auto.sohu.com/7/0903/95/column212969565.shtml
Http://auto.sohu.com/7/0903/96/column212969687.shtml
Http://auto.sohu.com/7/1103/61/column216206148.shtml
Http://auto.sohu.com/s2007/0155/s254359851/index1.shtml
Http://auto.sohu.com/s2007/5730/s249066842/index2.shtml
Http://auto.sohu.com/s2007/5730/s249067138/index3.shtml
Http://auto.sohu.com/s2007/5730/s249067983/index4.shtml
As expected, these URLs should be merged to
Http://auto.sohu.com/7/0903/70/column213117075.shtml
Http://auto.sohu.com/s2007/0155/s254359851/index1.shtml
The idea is as follows, need to extract the following features
1,host string
2, directory depth (with '/' split)
3, last feature
Specific algorithms
The algorithm itself is very vegetable, you can understand a look.
Actual effect:
Shows the 8 different URLs, the 2 values are calculated. In practice, in a Tens hash table, the conflict situation is acceptable.
0X05 Concurrent Operations
The main models involved in concurrent operations in Python are: multithreaded model, multi-process model, and co-model. Elias specifically wrote an article to compare the performance of several commonly used model concurrency scenarios. For the crawler itself, the limit crawler speed is mainly from the target server response speed, so choose a control up the module is right.
Multithreaded model
Multithreaded model, is the most easy to get started, Python threading module can be a good implementation of concurrency requirements, with the queue module to achieve shared data.
Multi-process Model
Similar to multi-process models and multithreaded models, there are similar queue modules in the multiprocessing module for data sharing. In Linux, the process of user-State can take advantage of multi-core, so it can solve the problem of crawler concurrency in multi-core background.
Co-process Model
In Elias's article, the performance of a Greenlet-based program is next to Stackless Python, which is roughly one-fold slower than stackless Python, and is nearly an order of magnitude faster than other schemes. As a result, concurrent programs based on Gevent (which encapsulate Greenlet) have a good performance advantage.
Specify the following gevent (non-blocking asynchronous IO). , "Gevent is a co-based Python network library that uses the high-level synchronization API provided by Greenlet that encapsulates the Libevent event loop. ”
From the actual programming effect, the co-process model does very well, the controllability of the running result is obviously strong, the Gevent library's encapsulation is very easy to use.
0X06 Data storage
The data store itself is designed with a lot of technology, as a side dish, but there are some small experiences that can be shared.
Premise: The use of relational database, the test is the choice of MySQL, and other similar sqlite,sqlserver ideas no difference.
When we do data storage, the goal is to reduce the interaction with the database, which can improve performance. Typically, whenever a URL node is read, a data store is stored for an infinite loop of such logic. In fact, such a performance experience is very poor, storage speed is very slow.
Advanced approach, in order to reduce the number of interactions with the database, each with the database from the previous Transfer 1 nodes into the transfer of 10 nodes, to the transfer of 100 node content, so that the efficiency of 10 times times to 100 times times the increase in practical applications, the effect is very good. :D
0x07 Dynamic crawler source sharing
Reptile model
At present, the crawler model, such as the dispatch module is the core module. The scheduling module is separate from the download module, the extraction module, the storage module shares three queues, and the download module shares a queue with the extraction module. The data is passed in the direction shown.
Crawler source
The following features are implemented:
Dynamic download
Gevent processing
Bloomfilter Filtration
URL Similarity filtering
Keyword filtering
Crawl Depth
GitHub Address: Https://github.com/manning23/MSpider
The code overall is not very difficult, you light spray.
0X08 Reference Articles
Thanks for sharing the following articles and discussions
Http://security.tencent.com/index.php/blog/msg/34 http://www.pnigos.com/?p=217
HTTP://SECURITY.TENCENT.COM/INDEX.PHP/BLOG/MSG/12 http://wenku.baidu.com/view/7fa3ad6e58fafab069dc02b8.html
Http://wenku.baidu.com/view/67fa6feaaeaad1f346933f28.html
http://www.html5rocks.com/zh/tutorials/internals/howbrowserswork/
Http://www.elias.cn/Python/PyConcurrency?from=Develop.PyConcurrency
http://blog.csdn.net/HanTangSongMing/article/details/24454453
http://blog.csdn.net/historyasamirror/article/details/6746217 http://www.spongeliu.com/399.html
http://xlambda.com/gevent-tutorial/http://simple-is-better.com/news/334
http://blog.csdn.net/jiaomeng/article/details/1495500 Http://bbs.chinaunix.net/forum.php?mod=viewthread&tid =1337181
Http://www.tuicool.com/articles/nieEVv http://www.zhihu.com/question/21652316 Http://code.rootk.com/entry/crawler
The analysis of crawler technology