Introduction Part 1: Detailed description of the crawling Process (2)
Through the above, we now have some basic concepts, and now we should be able to access the actual operations, because there is still a big gap between understanding the principles and practices.
Crawling is a cyclical process: crawlers generate a fetchlist set from webdb; the Extraction Tool downloads the webpage content from the Network Based on fetchlist; the spider program updates webdb based on the new link found by the extraction tool; then generate a new fetchlist. (Note: Spider is divided into two parts. There was a debate at a corporate seminar, as did Google. examples will be given in the future.) This capture loop often refers to the generate, fetch, and update loop in the nutch.
Generally, URL links under the same domain name are merged into the same fetchlist. In this case, when multiple spider crawlers are used simultaneously, repeated crawlers are not involved. Nutch follows the robots exclusion protocol. You can use robots.txt to protect private webpage data from being captured.
The combination of the above grabbing tool is the outermost layer of the nutch. You can also directly use a lower-layer tool to combine the execution sequence of these underlying tools to achieve the same result. This is the attraction of nutch. The above process is described in detail below, with the name of the underlying tool in parentheses:
- Create a new webdb (Admin DB-create ).
- Put the starting crawling URL into webdb (inject ).
- Fetchlist (generate) is generated from the new segment of webdb ).
- Capture the content (FETCH) of a webpage based on the fetchlist ).
- Update webdb (updatedb) based on the URL of the retrieved webpage ).
- Repeat the above 3-5 steps until the specified number of layers are reached.
- Update segments (updatesegs) with the calculated URL weight scores ).
- Index the retrieved webpage ).
- Remove duplicate content and duplicate URLs in the index ).
- Merge multiple indexes into a large index to provide the index Library (merge) for the search ).
After a new webdb is created, the capture cycle generate, fetch, and update automatically loops according to the root URL specified in step 2 in a certain period. When the capture loop ends, a final index is generated. From step 1 to step 2.
Note that the index of each segment is created separately in step 1 above, and then deduplication is eliminated (step 2 ). Step 2 is a success. Merge individual indexes into a large index database.
The dedup tool can remove duplicate URLs from the segment index. Because duplicate URLs are not allowed in webdb, that is, there will be no duplicate URLs in fetchlist, so you do not need to perform the dedup operation on fetchlist. As mentioned above, the default capture cycle is 30 days. If the generated old fetch is not deleted and a new fetch is generated, duplicate URLs will still appear. The preceding situation does not occur when only one capture program is running.
From the above introduction, we can see that in general, we only need to execute the program from the beginning, and do not need to access the underlying tools. However, there are many "accidents" in search engines, and a lot of time needs to be maintained, so the underlying tools also need to be mastered. I will show you how to run the above process below.
As I said at the beginning, this article is intended for a medium-sized search engine. For engines like Baidu that capture Internet Data, you need to refer to the following resources.
Resource list:
You must know the base camp of the Project page.
2. email list: nutch-user and nutch-Dev
3. At the time of writing this article, map reduce has been put into the Svn of nutch, but the version is not released yet. I remember Doug cutting went on vacation after he checked in mapreduce.
More resources:
There is also good news about nutch tutorial. Anyone who has written about eclipse plugin knows that the eclipse architecture is powerful. The plug-in of nutch is also based on Eclipse, but the current version is 2.0. For details, see plugincentral.
Search option
Building nutch: open source search
Nutch: a flexible and scalable open source web search engine
Address: http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html