Depth first or breadth first

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

From: http://www.tgxzs.com/article/news-15-101-%CD%F8%C2%E7%D6%A9%D6%EB%3A%C9%EE%B6%C8%D3%C5%CF%C8%BB%B9%CA%C7%B9%E3%B6%C8%D3%C5%CF%C8.html

"Web spider" is also called "web crawler "! I will not talk about the network spider overview here. Today I mainly want to talk about the crawling design methods and methods of spider.

There are two types:

So what is depth first? What is breadth first? What is the purpose? The Seo in Shanghai (swj) is explained below!

I am a beginner. I will only use plain words and principles to analyze them with you. If you have any errors, please contact me in time, so please forgive me!

One is the depth priority policy and the other is the breadth priority policy! We will analyze the two points below!

As the name suggests, "depth first" means that web crawlers should try to dig deeper into webpages when capturing webpages. What they pay attention to is depth!

It also means that the web spider will track a link from the start page, process the line, and then transfer it to the next start page to continue tracking!

As shown in the following figure: (The following is a simple web page connection model. Here, a is the starting point, that is, the starting point of the spider index !)

Five paths are divided for crawlers to crawl! It's about depth!

(The following figure shows the optimized webpage connection model! That is, the improved Spider's deep crawling policy !)

Based on the above two tables, we can draw the following conclusions:

Figure 1:

Path 1 ==> A --> B --> E --> H

Path 2 ==> A --> B --> E --> I

Path 3 => A --> C

Path 4 ==> A --> d --> f --> K --> L

Path 5 ==> A --> d --> G --> K --> L

After Optimization

Figure 2: (the picture has already been marked for everyone !)

Path 1 ==> A --> B --> E --> H

Path 2 => I

Path 3 => C

Path 4 ==> d --> f --> K --> L

Path 5 => G

The advantages of deep crawling are:

The Web Spider Program is relatively easy to design, so I didn't find any other advantages... the spider spirit is worth learning! Pai_^

The disadvantage of deep crawling is:

A little more disadvantages! Every time you crawl a layer, you always need to access the database of "home spider" and ask the boss if he needs to climb the next layer! One-step query .... if a spider keeps crawling down 3721, it is likely to lose his way and climb to a foreign website .. the target is a Chinese website. If a foreign IP address is used as a Chinese site due to IP address problems .... it's easy to go to other people's "Hometown .. this not only increases the complexity of system data, but also increases the burden on servers. I don't think any search company will be willing to take a look ,... unless the brain is "Show .. pai_^

Next we will introduce the commonly used breadth-first strategy. Everyone is tired of reading a cup of coffee while they are resting... ^

Breadth First is defined as layer crawling.

What is Spider layer crawling?

It is crawling layer by layer to index and crawl webpages according to the distribution and layout of layers! Of course, se will not send one Spider to each layer to send one or more spider to capture the content!

(The following figure shows the breadth-first policy (layer crawling ))

At first glance, you will understand that you do not need to read the following articles from smart people. ^

Based on the above table, we can draw the following conclusions:

Path 1 =>

Path 2 ==> B --> C --> d

Path 3 ==> e --> f --> G

Path 4 ==> h --> I --> K

Path 5 => L

Advantages of wide crawling:

The breadth and depth are easier to control data capturing! The negative load on the server is also significantly reduced! Distributed Processing of crawlers significantly improves the speed! Other ideas can also be pulled!

The disadvantage of wide crawling is:

I haven't observed any shortcomings yet. What do you think is the same as Div + CSS style sheets (layer layout?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Depth first or breadth first

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Depth first or breadth first

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support