Dyse: Implementation of a Java search engine, part 1th

Source: Internet
Author: User

Write a search engine yourself, think about how cool: Enter keywords in the interface, click Search, get the results you want, so what else can it do? Perhaps the site needs a search function, or a hard drive in the Chinese file search--Most importantly, do you feel that many IT companies are waving to you? If you are enchanted, then, let ' s go!

The reason for this is to start with the Java language instead of C + +, which provides a good foundation for our search engine implementations, because Java offers a number of basic packages and classes for network programming, such as URL classes, inetaddress classes, and regular expressions. Allows us to focus on the implementation of the search engine itself without being distracted by the implementation of these basic classes.

This three-part series will show step-by-step how to design and implement a search engine. In the first part, you will first learn how the search engine works, and understand its architecture, and then explain how to achieve the first part of the search engine, the Web crawler module, that is, the completion of the Web page collection function. In the second part of the series, the preprocessing modules, that is, how to handle the collected pages, the collation, participle, and indexing are all in this section. In the third part of the series, we will introduce the implementation of the information Query service, mainly the establishment of the query interface, the return of the query result and the realization of the snapshot.

The whole structure of Dyse

Before you start learning the module implementations of search engines, you need to understand the overall structure of dyse and the process of data transfer. In fact, the three parts of the search engine are independent of each other, and three parts work separately, the main relationship is reflected in the previous part of the data to provide the original data for the latter part. The relationship of the three is as shown in the following illustration:

Figure 1. Three-segment workflow of search engine

Before introducing the whole structure of search engine, we use the narrative method of "computer network-Top-down method to describe Internet characteristics", and introduce the specific work flow of search engine from the angle of ordinary users using search engine.

The Top-down approach describes the search engine execution process:

Users submit the query through the browser word or phrase P, search engine according to the user's query to return the matching page information list L;

The above process involves two questions, how to match the user's query and the page information list from where, according to what sort? User's query P after the word breaker was cut into a small phrase <p1,p2 ... pn> and was removed from the stop word (,, ah, and so on), according to the system maintenance of a inverted index can query a word pi in which pages appear, matching those <p1,p2 ... pn> all out. Now the page set can be used as the initial results, further, the return of the initial page set through the calculation and query the relevance of the word to get the page rank, that is page rank, according to the page ranking order to get the final list of pages;

Assuming that the calculation formula for the word breaker and page rank is set, then where does the inverted index and the original page collection come from? The original page set in the previous data flow introduction, it can be learned that the crawler spider crawling Web pages and stored in the local, and inverted index, that is, the mapping table of phrases to the Web page is based on the positive row index, the latter is the content of the page and the content of the word after the page to the phrase map, Inverted index can be obtained by inverting the positive index;

What does the analysis of the webpage specifically do? Because the crawler collects the original page contains a lot of information, such as HTML forms and some spam information such as ads, web analytics to remove this information, and extract the body of the information as a follow-up to the underlying data.

After having the above analysis, we can get the overall structure of the search engine as shown below:

Figure 2. The whole structure of search engine

Reptiles crawl from the Internet to get a lot of Web pages stored locally as original Web pages, then the Web page Analyzer extracts the content of the Web page to the word breaker for Word, the results obtained by the indexer to establish a positive and inverted index, so that the index database, user inquiries, in the form of cutting through the word breaker input Query phrase and query in the index database through the retrieval, the results are returned to the user.

Regardless of the size of the search engine, its main structure is composed of these parts, and there is no big difference, the quality of the search engine is mainly determined by the internal implementation of the parts.

With the above understanding of the overall search engine, we learn the specific design and implementation of the crawler module in Dyse.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.