Depth first or breadth first

Source: Internet
Author: User

From: http://www.tgxzs.com/article/news-15-101-%CD%F8%C2%E7%D6%A9%D6%EB%3A%C9%EE%B6%C8%D3%C5%CF%C8%BB%B9%CA%C7%B9%E3%B6%C8%D3%C5%CF%C8.html

"Web spider" is also called "web crawler "! I will not talk about the network spider overview here. Today I mainly want to talk about the crawling design methods and methods of spider.

There are two types:

So what is depth first? What is breadth first? What is the purpose? The Seo in Shanghai (swj) is explained below!

I am a beginner. I will only use plain words and principles to analyze them with you. If you have any errors, please contact me in time, so please forgive me!

One is the depth priority policy and the other is the breadth priority policy! We will analyze the two points below!

As the name suggests, "depth first" means that web crawlers should try to dig deeper into webpages when capturing webpages. What they pay attention to is depth!

It also means that the web spider will track a link from the start page, process the line, and then transfer it to the next start page to continue tracking!

As shown in the following figure: (The following is a simple web page connection model. Here, a is the starting point, that is, the starting point of the spider index !)

Five paths are divided for crawlers to crawl! It's about depth!

(The following figure shows the optimized webpage connection model! That is, the improved Spider's deep crawling policy !)

Based on the above two tables, we can draw the following conclusions:

Figure 1:

Path 1 ==> A --> B --> E --> H

Path 2 ==> A --> B --> E --> I

Path 3 => A --> C

Path 4 ==> A --> d --> f --> K --> L

Path 5 ==> A --> d --> G --> K --> L

After Optimization

Figure 2: (the picture has already been marked for everyone !)

Path 1 ==> A --> B --> E --> H

Path 2 => I

Path 3 => C

Path 4 ==> d --> f --> K --> L

Path 5 => G

The advantages of deep crawling are:

The Web Spider Program is relatively easy to design, so I didn't find any other advantages... the spider spirit is worth learning! Pai_^

The disadvantage of deep crawling is:

A little more disadvantages! Every time you crawl a layer, you always need to access the database of "home spider" and ask the boss if he needs to climb the next layer! One-step query .... if a spider keeps crawling down 3721, it is likely to lose his way and climb to a foreign website .. the target is a Chinese website. If a foreign IP address is used as a Chinese site due to IP address problems .... it's easy to go to other people's "Hometown .. this not only increases the complexity of system data, but also increases the burden on servers. I don't think any search company will be willing to take a look ,... unless the brain is "Show .. pai_^

Next we will introduce the commonly used breadth-first strategy. Everyone is tired of reading a cup of coffee while they are resting... ^

Breadth First is defined as layer crawling.

What is Spider layer crawling?

It is crawling layer by layer to index and crawl webpages according to the distribution and layout of layers! Of course, se will not send one Spider to each layer to send one or more spider to capture the content!

(The following figure shows the breadth-first policy (layer crawling ))

At first glance, you will understand that you do not need to read the following articles from smart people. ^

Based on the above table, we can draw the following conclusions:

Path 1 =>

Path 2 ==> B --> C --> d

Path 3 ==> e --> f --> G

Path 4 ==> h --> I --> K

Path 5 => L

Advantages of wide crawling:

The breadth and depth are easier to control data capturing! The negative load on the server is also significantly reduced! Distributed Processing of crawlers significantly improves the speed! Other ideas can also be pulled!

The disadvantage of wide crawling is:

I haven't observed any shortcomings yet. What do you think is the same as Div + CSS style sheets (layer layout?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.