Summary of the second development of nutch-conclusions drawn from the query and Analysis of nutch

Source: Internet
Author: User

A search engine architecture has been built in nutch. On this basis, you can perform secondary development, it provides individual search engines, enterprise LAN search engines, search engines for the entire web, and other search engines of different sizes. You can also create a search engine for some special purposes. No matter what kind of search engine is built, you usually need to perform secondary development on it to make corresponding functional modifications to the source code. Through this experiment, we have summarized some key links in secondary development. Here we will give a rough description.
1.1Information Source Selection and standardization
        The selection of information sources reflects the business scope of the search engine: If you select a website or a website group, it is an enterprise LAN search engine. If you select the entire network, it is a comprehensive search engine. If you select a topic-type website or webpage, It is a vertical search engine. If you select a blog, it can be called a blog search engine; if you select a document in a certain format, it can be called a file search engine. Therefore, before secondary development, identify the requirements, analyze the main target website, and select as the source of information.
        By formulating the corresponding URL rules, nutch crawls the specified URL, that is, filtering information. By default, configurations can be performed in the relevant configuration file, which uses a regular expression to regulate the URL. Of course, you can also compile the corresponding plug-ins to implement the URL specifications.
1.2Information preprocessing
        The information preprocessing here refers to converting the content downloaded by the nutch crawler into the text that can be called by the nutch indexer. The information preprocessing process mainly involves the following:
        (1) format recognition and text extraction. Generally, the document downloaded by the nutch crawler is HTML, but there are many other types of texts on the Network: txt, Doc, PDF, xls, and RTF, there are even multimedia document formats. Before indexing, text information must be extracted from the downloaded files. The extraction methods for different Format documents are also different. By default, nutch can directly process HTML and TXT files, while others have been implemented but are not loaded. Currently, many open-source software can extract text information, such as the poi of Word documents and PDF-reader of PDF documents. During secondary development, you need to write and extract text tools for the corresponding document format.
        (2) Information filtering. The Information Filtering here refers to filtering the extracted text content that does not want to make it exist. This process is not necessarily independent, there may be intersection with the previous process. For example, if you do not want to be indexed in some areas of a website, then, you can write a plug-in to filter such webpages on the website and remove content from the area.
        (3) convert the encoding format. The Information encoding formats on the network are varied and are not particularly standard. Generally, after processing, the encoding can be unified, but some information cannot be well converted by the default program. In this case, you should extend the encoding to achieve conversion.
1.3Localized index construction
        The information that has been pre-processed can be directly indexed as a nutch. There are also many factors to consider in the indexing process. First, Word Segmentation in Chinese languages. This has been analyzed and summarized in detail in the previous experiment. Second, further information processing is in the process of searching for a set of words that best express the original semantics. There are also some other related technologies such as stem extraction, stop word, ontology and so on. This process is a very important process, which directly determines the query service effect.
1.4Sorting rule formulation
        The formulation of sorting rules not only affects the query results, but also runs through the entire search engine process. There are many factors that can affect sorting rules, such as relevance to user needs and system business needs, such as Word Frequency in the document, word frequency in the entire document space, word location, and so on, even information time will affect the sorting. Therefore, in secondary development, you need to formulate sorting rules based on your needs and reflect them in the system.
1.5Query System and user interface
        The query system of nutch is published in Tomcat. It provides a query interface similar to Google and supports multiple languages. In actual secondary development, it does not necessarily support multiple languages and can be rewritten for a specific language. In addition, you can perform secondary modification to the query process, change the query method, add pagination, and add summery. For the user interface, you can simply rewrite it according to the actual situation.
 
Conclusion obtained from the query and Analysis of nutch
 Through the experiment data, you can get some information about timesNewRoman "> nutchSearch engine conclusion:
1. TimesNewRoman "> nutchThe default index is word segmentation, that is, times.NewRoman "> n-gramsModerate timesNewRoman "> n = 1, This word segmentation method is relatively simple, but the index is relatively large, which can be compared with the index storage size: TimesNewRoman "> nutchIndex file size: TimesNewRoman "& gt; 85.8 MBAnd timesNewRoman "> paodingAfter word segmentation, the index is only times.NewRoman "& gt; 76.8 MB. Therefore, timesNewRoman "> nutchThe default word segmentation index is large, which is relatively difficult to maintain and increases the query complexity. From the statistics, we can see that for Chinese queries, timesNewRoman "> nutchThe check-out result is large by default, but most results are irrelevant. For English, the results are relatively good, the search is accurate, and the correlation is good. This is mainly timesNewRoman "> nutchBy default, query requests are also split by single words. documents that contain these words are queried without considering whether words are adjacent to each other.
TimesNewRoman ">       2. TimesNewRoman "> paodingIndex query results are relatively small, but the preparation is high, the reason is that timesNewRoman "> paodingThe dictionary-Based Word Segmentation index is performed, and the request is also segmented during the query, and then the query is constructed. However, we can see that there is no query result for "Information Management", timesNewRoman "> paodingWord Segmentation divides it into "information" TimesNewRoman "> +"Management", and the word "interest" may not be an actual index in the index, so there is no relevant record. So timesNewRoman "> nutchWhy not use a certain policy to retrieve the most relevant information? Here I feel that timesNewRoman "> nutchBut the vector model does not seem to play a role.
TimesNewRoman ">       3. TimesNewRoman "> nutchDefault index mode and timesNewRoman "> paodingThe index does not have much impact on English, and the results do not differ much. In addition, the search results are better than those in Chinese.
TimesNewRoman ">       4. For "Communication Management" and "Information Management", "Nokia" and "times"NewRoman "> NokiaThe results are different. However, in terms of human understanding, the two words in each group should generally have the same meaning. Should we consider using a mechanism to query all words with the same meaning? TimesNewRoman "> nutchThe plug-in has a TimesNewRoman ">" ontology"Plug-in, which is an ontology plug-in, which indicates that timesNewRoman "> nutchI have already thought of such a mechanism, but it does not improve it. The construction of the Library also requires considerable effort.
TimesNewRoman ">       5. In addition, the query request contains "Magic blow", which is a TimesNewRoman "> BBSTimes of someoneNewRoman "> IDThe Semantic Expression of the query is not obvious. If the timesNewRoman "> IDIf it can also be queried, should it be independently developed into a virtual character search engine?
TimesNewRoman ">       Through this discovery, timesNewRoman "> nutchShould the same analyzer be used for all webpages to call different analyzer for different websites for better information collection? For example, you can filter out the left column of a website, and filter out the header information of other websites. So, in timesNewRoman "> GoogleSubmitted the request "x blow times"NewRoman ">Site: TimesNewRoman "> bbs.ccnu.edu"TimesNewRoman "> GoogleData processing capability far exceeds timesNewRoman "> nutchIn addition, timesNewRoman "> GoogleAnd timesNewRoman "> nutchSimilarly, the title is simply indexed, and the same content in the title is not filtered out.NewRoman "> poweredByDiscuz!". For timesNewRoman "> GoogleSuch a comprehensive search engine may not have the energy to consider such analysis for every website.NewRoman "> nutchFor vertical search engines, it is necessary to filter theme data.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.