Nutch Is a Lucene-based , Similar to Google's complete network search engine solution, based on hadoop The distributed processing model ensures the system performance. The plug-in mechanism similar to eclipse ensures that the system is customizable and can be easily integrated into your own applications.Nutch 0.8 completely uses hadoop to rewrite the backbone code, and many other places have made reasonable corrections. And is worth upgrading. 1. Install and run nutch 0.8.
The Chinese installation documentation of nutch 0.7.2 is full. For the installation documentation of nutch 0.8, see tutorial (0.8) Pay attention to the following two points: First, the URLs parameter in the crawl command is changed from the specified file to the specified directory, that is, the original URLs must be changed to URLs/Foo. Second, the HTTP. Agent. Name property in the nutch-default.xml is empty by default and must be set to this property in the nutch-site.xml; otherwise, an error occurs. Note that log4j is used to output the information during crawling in the/logs directory. By default, it is no longer directly output to the screen unless you set Fetcher. verbose to true in the configuration file. Http://www.getopt.org/luke (Luke) It is a required index reading tool. In addition, nutch needs to run on Unix. to install it on Windows, you can install cygwin first. (Download the local setup.exe online installation is completed soon ). Finally, the recawl script of nutch 0.8 It is also different. 2. nutch you shoshould know 2.1 Documents
There are not many documents in the nutch, and most of them are installation documents. To go deep into the nutch, you must read them without missing a word: Introduction to nutch, Part 1 Crawler And introduction to nutch, Part 2 searching Then I read the source code. Fortunately, the source code of nutch is very uniform, brief, and not fancy, so it is easy to understand things. 2.2 three directories
First, understand the three data directories of the nutch: 1. crawdb, linkdb It is the web link directory that stores the interconnection relationship between URLs. It serves as the basis for crawling and re-crawling. The page expires in 30 days by default. 2. segments It is the main directory that stores the captured webpage. The page content is in the format of bytes [] raw content and parsed text. Nutch crawls Based on the breadth-first principle. Therefore, each round of crawling generates a Segment directory. 3. Index Is the Lucene index directory, which is the complete index after all indexes in indexs are merged. Note that the index file only indexes the page content and is not stored, therefore, you must access the segments directory to obtain the page content. 2.3 crawling Process
Crawling process in introduction to nutch, Part 1 crawling For more information, see the crawl class to understand the crawling process. Here is a more intuitive figure: The entry address, address regular expression, and search depth are used to limit the number of entries. Because hadoop is used (later), the Code of nutch is written in the hadoop mode to obtain the distributed capability. Therefore, you must first understand hadoop and understand Mapper, CER, and inputformat, the role of the outputformat class can be better read. 1. fetcher class , Run fetcherthread in multiple threads in run (), call the appropriate protocol plug-in (supporting HTTP, FTP, and other protocols) to obtain the content, and call the appropriate Parser analyzes the content (HTML, PDF, Excel) as text, puts the content in the fetcheroutput class, and The fetcheroutputformat class defines the process of writing a disk to a segments. 2.Indexer class , Use hadoop to traverse all the segments directories, serialize the parsedata file to the parsedata class, obtain various data, call the plug-in for indexing, and finally write the index by the ouputformat class. Note: If you only want to use the Web Crawler of nutch, rather than its indexing function, you can rewrite your own implementation like indexer. For example, you can directly move the segments content into the database. 3. The fields of each index record in the nutch URL : It is a unique tag value generated by the basicindexingfilter class. Segment : Generated by the indexer class. The content of the page captured by nutch is placed in the segments directory, Lucene Only index, not store the original content. Therefore, you must use the segment and URL as the foreign key during the query. The fetchedsegments class uses The segments directory obtains the content. Boost : Priority, which is calculated by the indexer class calling plug-in. Title : Displays the title, which is indexed and stored in the basicindexingfilter plug-in. Content : Mainly searched items, which are indexed in the basicindexingfilter plug-in. 2.4 search process
A typical code is as follows: NutchBean bean = new NutchBean(); Query query = Query.parse(args[0]); Hits hits = bean.search(query, NUM_HITS,"title",true);
for (int i = 0; i < hits.getLength(); i++) { Hit hit = hits.getHit(i); HitDetails details = bean.getDetails(hit);
String title = details.getValue("title"); String url = details.getValue("url"); String summary =bean.getSummary(details, query); }
Here, nutchbean has done a few things for us: First, sort by title field. Second, distributed query is supported. If servers is configured, hadoop's IPC is used. The system calls the nutchbeans on all servers, and finally the overall result is specified. Third, each site only displays one page with the highest score like Google. If you want to view other results of the same site, you need to call the API for further access. The fourth is to generate a summary, get the content from the segments directory by segments and URL, and extract the file fragments that contain keywords like Google according to certain algorithms. 3. modify source code or compile plug-ins
The source code of nutch is easy to modify and recompile, note that the newly compiled class should be compressed back to the nutch-0.8.job (actually a jar) to take effect. The plug-in mechanism and degree of nutch is similar to eclipse, look at the http://wiki.apache.org/nutch/WritingPluginExample To implement a plug-in interface, and then define the class, extension points, and dependent jar in plugins. XML, such = " Index-Basic " Version = " 1.0.0 " Provider-name = " Nutch.org " >
= " Index-basic.jar " >
= " * " />
= " Nutch-extensionpoints " />
= " Org. Apache. nutch. indexer. Basic "
Name = " Nutch basic indexing Filter " Point = " Org. Apache. nutch. indexer. indexingfilter " >
= " Basicindexingfilter " Class = " Org. Apache. nutch. indexer. Basic. basicindexingfilter " />
Finally, gossip: Doug cutting, translated by Comrade dedian -- about the development of search engines . Finally, I would like to thank the translator who created the phrase "C ++ must know :) |