Research on Chinese search engine

Source: Internet
Author: User

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

at present, the application of search engines is more and more wide, is a necessary tool for Internet users.





in China to use a wide range of search engines are: Baidu Google search of the North Skynet search Sogou and some professional search, such as the mass do music search http://www.1234567.com and the founder of the West Shrine Alley http:// Www.pagou.com, these are pretty good. This shows that the search engine market is still very large. In particular, Baidu's successful listing, to the industry a great encouragement.





current major search engine models are user input some keywords or sentences, either kind, the search engine will first of the user's input to participle, so as to increase the accuracy of the search results, which is different from the ordinary database search (ordinary database search, just simple with like% Keyword%), and then search engines go to the vast index library to find these and user input related information, the results will be displayed contains the relevant summary of the page.








Chinese search engine related technologies include: web spiders, Chinese word segmentation, Index Library, Web page abstract extraction, web similarity, information classification.





1. Web Spider


Web spider refers to the vast network crawl information program, they are often more than threads, day and night to crawl the network information, but also to prevent a site crawl too fast, resulting in information provider server overload.





Web Spider's basic principle: first from a start page (proposed from the Yahoo Chinese directory or dmoz Chinese catalog) to start crawling, get this page content, summary, and then extract all the links on the page, spiders then crawl these connections, has been a steady stream of crawling. These are just basic principles, the actual application of a lot of complexity, you can try to write a spider, I used to write PHP (PHP can not be multiple threads, defects. )





2. Chinese word


Chinese participle has always been the key point of Chinese search engine, Chinese different English, English each word is separated by a space, and Chinese a sentence is often a number of words, no split character, people can easily read the meaning of sentences, but the computer is difficult to understand.





at present I understand the Chinese word segmentation method (said to have a foreigner's Chinese word segmentation method), almost all have their own Chinese dictionary, participle to dictionary matching, to achieve the purpose of participle, participle of good or bad, and dictionary relationship is very large. You can see my last article, is written in PHP Chinese word segmentation method.





This
is written by many of the master's theses in linguistics at present.




Baidu uses its own development participle method, Google uses the 3rd party participle method.





Chinese participle is very good, but commercial.





Rabbit-Hunting Chinese word segmentation method is also good, but. So, can not study





3. Index Library


search engines do not use the already-formed database system, they are their own development of similar database functions.


search engine need to save a lot of web information, snapshots, keyword index (recommended should also save the screenshot of the page, I am in the study), so the volume of data is particularly large.





4. Extraction
of Web page abstract




Web page summary refers to a Web page information summary (Junior Middle School Chinese class, teachers often let the summary of the main idea of the article, on this meaning, I am afraid of the teacher asked me to sum up, people summed up so difficult, now let the computer summed up, days, search engine search results, often there will be a page under the title, there will be some introduction, Make it easy for searchers to find out if this article is the right information.





5. Web page Similarity





often have a lot of content on the Web site, such as the same news, the major portal sites will be released, their news content is the same. There are also some personal sites, especially the website of stealing other people's website, and other people's website do exactly the same (I have done, under PS), such a site is meaningless, search engine will automatically distinguish, reduce its weight (Baidu the most ruthless, direct sealing station, I tried).





At present, I study several methods of computing the similarity of Web pages as follows:


1) According to the Web page summary, if the MD5 value of multiple Web pages is the same, it proves that these pages have very high similarity


2 According to the page appears the keyword, according to the word frequency order, can take n a high frequency, if the MD5 value, prove these pages have very high similarity.





Google Baidu News, is the application of this technology.





at present many colleges and universities data Mining professional graduate thesis all wrote this





6. Automatic classification of information




The information of
network is too huge, how to classify it, it is the difficult problem that search engine faces. To get the computer to classify data automatically, you have to train the computer program, and I'm currently working on the





is good at climbing dogs.





above is my personal understanding of the search engine after the view, are Liu Zhijiang original, which will inevitably have the understanding is not comprehensive or wrong, I implore you to correct me (don't hit me on the line)!





Next I will show you how to build a simple search engine. Keep your eye on it.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.