Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
Do regular SEO technology will be the search Engine Crawl Web page principle should be very complex. Can write a voluminous essay. Here is my own learning of the web crawl principle record. The notes should not be complete, just an introduction to the important process.
First to crawl the work of the page needs to assign a search engine spider. That's the robot. Every search engine has a large number of spiders to send. In a time when this information update is so fast, every spider is not idle. The search engine assigned a large number of URLs to each spider. Make sure everyone's work is not duplicated. Each URL has and only a fixed spider responsible for crawling
When a spider gets a job and comes to the target site, there are two strategies for crawling. Depth first and breadth priority respectively
Depth first is the way to the end. Find a link and crawl to it. Like the red number order of the following figure. came to the computer page. See the first link is a desktop computer into the desktop page. The first link on the page is the Dell desktop and then continues to enter the page. When the Dell Desktop page is gone, the link is returned to the desktop computer page to crawl the second link Lenovo Desktop
Such crawling disadvantage is not in accordance with the important order to include the page. Desktop Laptop Tablet PCs are both important and updated more frequently than Dell desktops and so on. Therefore, we need to give priority to crawling columns.
So breadth first becomes the main strategy that the search engine collects. That came to the computer home after the discovery of three connections, the first two links into the work of the Task table, and then the first is a desktop page included content. And keep a link between the links, such as Dell Desktop and Lenovo desktop addresses, and save them to the work plan. Plan to catch.
Finish working on the desktop and then go to the laptop page. Keep the link in the schedule after you crawl the content. Finally came to the tablet computer page included in the contents of the link into the table
After all the columns have been included in the plan from the list before the address deposited. This is the Dell desktop Lenovo Desktop in turn. HP notebook ASUS notebook and ipad. So
The following figure Blue Numeric order
Of course, for most sites, each page has the same end, is used to show the location of columns and links, so a large number of pages and end of the same point to the same link will be spider ignored, this time the link in the body will become its record and crawl the goal
Of course, the included pages are not complete pages. Instead, the contents are extracted to the high quality content after the meaningless words and so on, then add the filtered content to the analysis system to get the article keyword and so on.
You reprint this article I'm flattered, hoping to keep the original address: http://www.ijseo.net/?p=592 as an incentive
, also welcome your comments!