Recently I am working on a project on search engines, so I have studied a bit about search engines. Our goal is to build a lightweight search engine, which is relatively simple for commercial search engines.
For projects such as search engines, I think the focus is on quality requirements, and the functional requirements may be weaker. High concurrency, high storage capacity, and fast query are the lifeblood of a search engine. In terms of functions, we should pay attention to the implementation of several algorithms. In the past, most of the projects only focus on the implementation of functions and have low performance requirements. This project requires us to pay attention to this aspect and is also a good learning process.
Based on the needs of the project, the project is divided into four modules: capture module, analysis module, search module and user interface module. Divide the requirements of each module accordingly. According to requirements and actual hardware conditions, the architecture of the search engine is initially designed, for example.
Capture module and analysis module
The first is the capture module and analysis module, which regularly crawls and analyzes webpages on the Internet. Store the crawled and analyzed data to the database. Databases are mainly divided into four parts: chain tables with link structures, content tables with webpage content, index tables with inverted keyword indexes, and spot tables for bidding ranking.
When crawling and analyzing data, you need to pre-store a part of the webpage link in the chain table, crawl the webpage based on the existing webpage link, and store the new link in the database, process web page crawling links in a queue-like manner. Data in a chain table should be sorted based on the number of reverse links in the future. Store the crawled and analyzed pages in the content table, and the analysis module analyzes the web page to create an inverted index for word items and store them in the index table.
In terms of quality requirements, data in the database must be encrypted and data synchronization between multiple servers is required. We designed databases on multiple servers for fully redundant storage and data synchronization among multiple databases. After a Web page is crawled and analyzed on each server, the newly added data on the server is sent to another server to maintain data synchronization.
Search Module
The main function of the search module is to process user requests and return the results to the user for search. Because databases on multiple servers are redundant, you only need to search for the server when searching.
The search module receives user requests and performs word segmentation and synonym processing for user requests. Then, you can search for the result set by querying the index table and content table, and sort and weight the result set according to the chain table and bidding table. Return the sorted result set to the user.
In terms of quality requirements, a large number of users are required to execute concurrently, and multithreading and good bean containers are required.
User Interface Module
The user interface module runs on another Web server. The web server displays webpages and accepts users' requests. First, it analyzes users' input sensitive words, send the analyzed request to the backend server, obtain the search result from the backend server, and display the result to the user.
Quality requires multi-user concurrency, mainly reflected in the requirements for good web containers and multithreading, as well as load balancing when sending requests to backend servers.