People who know about nutch basically appreciate this open-source system, at least in China, and many search websites are modified based on this system, but they must do well, it is actually a commercial search, and this modification is not just overnight, or as simple as repairing and cutting. As a general network-wide search engine architecture, nutch (Lucene) indeed provides a big cake for the masses of the people, greatly lowering the threshold for entering the search industry. How far is it from commercial search? In my opinion, let's talk about it.
I. Overall Functions
A professional Web search engine consists of at least three parts: capture, processing, and search. The following are their general functions:
Crawling: crawlers (Spider, crawler, crawler, And SPIDER) crawl a specific network (or the entire network ), download the pages on the network and other required files to the local device. The difficulty is JS analysis and identity authentication issues caused by the popularity of web.
Processing: Processes (classification, information extraction, data mining, classify, information extraction, data mining, etc.) programs to analyze the captured page, for example, classify website content, extract news information on news pages, generate page templates, and calculate the relationship between websites.
Search: the Information Retrieve program fills documents in the database, and then finds the most relevant documents in the database based on the query string.
II. Information capturing
Network Information capturing includes page capturing, text file capturing, and other file capturing. Common Information capturing uses the basic HTML page analyzer (htmlparser, neckohtml, jtidy, etc.) to parse the page and obtain the information. There are basically two points: capture and analysis.
To capture this step, you need to handle the identity authentication and support multiple protocols. Here, the default plug-in of nutch uses nekohtml, and the effect is acceptable. However, the text of the HTML analysis result of nutch combines all the texts on the page (one of the switches controls whether the inner anchor text is added) as the total text output, so all the noise on the page is not removed.
.
The other is analysis, analysis of an HTML, which is the most powerful browser, such as IE and Firefox. In this step, the processing capability of the default htmlparser of nutch is different from that of the Japanese. Currently, Ajax is prevalent, and processing JS is also a major issue. Now, nutch turns a blind eye to Js.
Iii. Information Processing
Information processing is the weakest part of nutch, and it is also a "treasure" in this industry. The victory or defeat decision is here. This includes classification, information extraction, data mining, classify, information extraction, and data mining.
The default nutch component contains the cluster package, which is used for clustering search results,
The default clustering of nutch is to use the suffix tree algorithm of open-source carrot2 for Web text clustering.
There is also Ontology (Ontology), which is a concept within the scope of artificial intelligence.
The emergence of ontology research hotspots and semantic
The proposal and development of Web are directly related. With the reasoning rules in ontology, the application system has certain reasoning capabilities. The default nutch also includes a simple ontology response.
Use System-HP's Jena. But for a commercial application, these are just a mold.
Nutch has prepared the most basic interfaces for this, and the others have to be done by themselves, such as machine learning (ML), Natural Language Processing (NLP), and data analysis (DA )..
While
4. Search
In fact, in terms of functions, nutch is composed of crawlers and searches. Lucene is used for searching. Therefore, the limitations of this Part are actually the limitations of Lucene. Lucene may also be divided into two functional parts: an index and a query. It has been a long time for this part of research to return the most desired document to the user. for search engines, speed is very important. There are two types of indexes: forward and reverse indexes (inverted indexes and inverted indexes ). The former indicates all words in a document, and the latter indicates all documents containing a word. Corresponding to Lucene, its forward index can be considered as a term vectors (word vector) related file, including. tvx,. TVD and. tvf files. There is no good comment on the forward index. It is generally used as a basis for restructuring the original data, and its construction is very simple and clear. Reverse indexes correspond to indexes on Lucene ).
Lucene divides indexes into segments (blocks are actually small indexes). It intuitively means that when a batch of new data arrives, we generally construct a new segment for it, because it is costly to modify the original segment (not to say it must be very high, only the file structure used by Lucene cannot be simply added to new documents ).
When an index contains too many segments, the query performance is very poor (because multiple segment needs to be queried at a time), and segment needs to be merged. In terms of search, nutch implements external processing for Lucene. First, distributed search can be performed. Each node returns only the results with the highest score, which is then merged recently. On the other hand, the query is buffered, however, there is only one level of buffer-LRU (Study on the Cache Policy and Cache Policy of nutch ).
V. Conclusion
From the perspective of the search engine architecture, the missing part of Lucene is information processing. Information processing is precisely the core technology of the entire search engine. Therefore, for the current online and vertical search era, the inherent shortcomings of nutch are already fatal, but this is not irretrievable. The plug-in architecture of nutch is, open System logic and other features have opened windows for developers. The more general processing logic of nutch, coupled with a flexible plug-in architecture, gives us the wings to customize it. But it is just a framework, and any details in it will make you have a headache (such as ml, NLP )! So the real difficulties are the headaches. Generated by Bo-blog 2.1.1 Release