At present, the application of search engine is more and more wide, is the Internet essential tool of Netizen.
In China, the use of a wide range of search engines are: Baidu Google search of the North Skynet search Sogou and a number of professional search, such as the mass of music to do the search http://www.1234567.com and the founder of the West Shrine Alley to do HT Tp://www.pagou.com , these are all pretty good. This shows that the search engine market is still very large. In particular, Baidu's successful listing, to the industry a great encouragement.
The current major search engine models are all, user input some keywords or sentences, either kind, the search engine will first of all the user's input to participle, which can increase the accuracy of the search results, which is different from the normal database search (ordinary database search, just simple with the like% keyword), The search engine then goes to a massive index library to find the information that is relevant to the user input, and the results will include a summary of the page.
Chinese search engine related technologies include: web spiders, Chinese participle, index library, Web page abstract extraction, web similarity, information classification.
Web spiders are the vast network crawl information program, they are often more than threads, day and night to crawl the network information, but also to prevent a site crawl too fast, resulting in information provider server overload.
The basic principle of Web spider: Start from a start page (suggest from Yahoo Chinese catalog or DMOZ Chinese catalogue) begin crawl, get this page content, summary, then extract page all connection, Spider then crawl these connections, have been continuously crawl. These are just basic principles, the actual application of a lot of complexity, you can try to write a spider, I used to write PHP (PHP can not be multiple threads, defects. )
Chinese participle has always been the key point of Chinese search engine, Chinese different English, English each word is separated by a space, and Chinese a sentence is often a number of words, no split character, people can easily read the meaning of sentences, but the computer is difficult to understand.
At present, I understand the Chinese Word segmentation method (it is said that there is no dictionary of Chinese Word segmentation method), almost all have their own Chinese dictionary, participle to dictionary matching, to achieve the purpose of participle, participle of good or bad, and a large dictionary relationship. You can see my last article, is written in PHP in Chinese word segmentation method.