Search engine
Search engine refers to a certain strategy, the use of specific computer programs to collect information from the Internet, in the organization and processing of information, to provide users with retrieval services, users to retrieve relevant information to display to the user's system. Search engines include Full-text indexing, directory indexing, meta search engines, vertical search engines, collection search engines, portal search engines and free link lists. Baidu and Google are the representatives of search engines.
Working principle
The first step: crawling
Search engine is a specific pattern of software tracking links to the page, from one link to another link, like spiders crawling in the spider web, so called "Spider" is also known as "robot." Search engine spider crawling is entered a certain rule, it needs to comply with some of the command or file content.
Step Two: Crawl storage
Search engines crawl through the spider tracking links to the Web page, and will crawl the data into the original page database. The page data is exactly the same as the HTML that the user's browser gets. Search engine spiders in the crawl page, but also do a certain amount of duplicate content detection, once the weight of a very low site has a large number of plagiarism, acquisition or duplication of content, it is likely to no longer crawl.
Step Three: Pretreatment
The search engine will crawl the spider back to the page and perform various steps of preprocessing.
⒈ Extract Text
⒉ Chinese participle
⒊ to stop the word
⒋ eliminate noise (search engines need to identify and eliminate these noises, such as copyright notice text, navigation bars, advertising, etc.)
5. Forward Index
6. Inverted index
7. Link Relationship Calculation
8. Special document Processing
In addition to HTML files, search engines are usually able to crawl and index text based on a variety of file types, such as PDF, Word, WPS, XLS, PPT, TXT file and so on. We often see these file types in search results. But search engines can not deal with pictures, videos, Flash, such as non-text content, and can not execute scripts and programs.
Fourth step: Ranking
After the user enters the keyword in the search box, the ranking program calls the index database data, calculates the rank display to the user, the ranking process and the user direct interaction. However, as a result of the large number of search engine data, although to achieve a daily there are small updates, but the general situation search engine ranking rules are based on the day, week, month different amplitude of the update.