Crawler Technology Practice

Source: Internet
Author: User

Crawler Technology Practice

In the previous article, crawler Technology Analysis (http://www.bkjia.com/Article/201411/353078.html) introduced the basic technology of crawler, and shared a dynamic crawler demo. This article mainly describes the practical effects of crawler technology. The results of this introduction are as follows:

1. crawler URL focus and filtering

2. URL similarity Algorithms

3. Detailed explanation of crawling policies

4. Mspider tool usage instructions and effect display

5. Mspider brainstorming

0x01 crawler URL focus and filtering


Why do crawlers need URL focus and filtering? Because we need to control the expected results!

For example, if the crawler aims to crawl the list of known vulnerabilities of the red/Black alliance, the general type of crawlers cannot be satisfied. Generally, crawlers perform the following crawling. The crawled URL is messy.



The URLs crawled by the crawler are as expected. As shown in.

As you can see, both focused crawlers and non-focused crawlers crawl from www.bkjia.com. Focused crawlers can control URL crawling policies as much as possible based on certain policies, but not focused crawlers cannot meet this specific requirement.

In the crawler Filtering module, we need to understand what is filtering and what is focus. Simply put, if the filter keyword is in the url, False is returned; otherwise, True is returned. True is returned if the focused keyword is in the url; otherwise, False is returned. For details, see the urlfilter. py file in Mspider.

0x02 URL similarity Algorithms


The URL similarity algorithm is self-evident in crawlers. This algorithm directly determines the crawling efficiency of crawlers. I will explain my algorithms as a reference.



This algorithm mainly relies on URL splitting and HASH of disassembling objects. This algorithm is applicable to similar requirements. This algorithm Splits a URL into three dimensions. The first dimension is netloc, the second dimension is the length of path, and the third dimension is the sorted list of parameters of the query object. A Data Structure combines the preceding three dimensions to construct a hash object.



Removing duplicates using the set data structure can greatly reduce the hash result conflict problem of common URL similarity algorithms. The actual result is as follows.


When 875 links are crawled, the actual number of similar pages has reached 269849. Therefore, the crawler efficiency is greatly improved.

This algorithm is only a practical experience, and has achieved good results after practice. I hope you can think about a better similarity algorithm.

0x03 detailed explanation of crawling policies


Generally, crawling policies take precedence over breadth and depth. Deep priority search is implemented through stacks, while breadth priority search is implemented through queues.

The numbers shown in the figure below show the order in which deep-Priority Search vertices are accessed.


The numbers shown in the figure below show the order in which the breadth-first search vertex is accessed.


The crawling policy is implemented in Mspider. All three search methods are adjusted by sorting the URL queue and by adjusting the depth parameter of the URL node in the queue to meet the requirement, the crawling policy is set based on different requirements to quickly discover suspicious URL links.


When 810 links are crawled, the actual crawling depth reaches 15 layers. In the actual test process, in-depth crawling can better detect suspicious links.

0x04 Mspider tool usage instructions and effect display


Mspider is a self-developed crawler tool that runs in CLI mode. This crawler implements the following functions. I hope you will have more stars and fork. You are also welcome to send an email to me at any time to raise a bug.



CLI:



This crawler can also perform more detailed configuration through the config. py file, such as crawling interval, ignoring the tag list, dynamic/static allocation ratio, and user-agent dictionary.



Provides several test scenarios:

White hats list of the crawling red/Black Alliance

Command:
Python run. py-u "http://www.bkjia.com"-k "2cto.com" -- focus "whitehats" -- similarity 1 -- storage 3

Effect:

Crawls the edu site in depth first to filter url similarity. The number of crawlers is 500.

Command:

Python-u run. py "http://www.njtu.edu.cn/"-k "edu.cn" -- policy 1 -- storage 0 -- count 500

Effect:


Crawling Lenovo domain name connections, domain name filtering bbs keywords, dynamic crawling web pages, random Priority Policy

Command:

Python run. py-u "lenovo"-k "lenovo" -- ignore "bbs" -- model 1 -- policy 2

Effect:

0x05 Mspider brainstorm by rewriting some Mspider functions, such as the PAGE analysis module, URL filtering module, and data storage module, you can obtain more in-depth and interesting information.

For example, you want to obtain the SQL Injection payload of the SQL injection vulnerability in the red/Black consortium. In the focus crawler, by rewriting the data storage module, let the database record the obtained dom tree and store it, And then perform data mining on the dom tree, you can simply implement it.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.