Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
Sniff the software studio developed several software and search engine technology has a lot of overlap, such as the upcoming projspider.com is actually a simple vertical search engine, in addition to our multiple projects in the application of the Web crawler module is also an important part of search engine technology.
Although none of the engineers who sniffed the software studios were involved in the development of a large search engine, they were interested. Based on some similar project experiences and public data, this paper makes a shallow solution to search engine related technology.
1, Reptile (Spider)--Data source
As a source of search engine mass data, crawler is an important part of search engine technology, smell the software studio has its own development of reptiles, so the technology is very familiar.
The English of Reptiles is spider, in fact, translated into spiders easier to understand, countless web site links constitute a huge net, search engine content acquisition program is like a only industrious spider crawling on the net, each encounter a node of interest will be recorded and left to other procedures to deal with.
The realization of the crawler is not difficult, the author uses C + + to develop a set of reptilian prototype only about 500 lines of code, and in Python, less than 100 lines.
However, any program involved in the massive data processing of its development difficulties and development cycle will become very large. To give a simple example, to determine whether a link is crawled, this is the crawler every analysis of a link to do after the judgment. If your memory at this time only thousands of, tens of thousands of of the link, even if the traversal of the comparison can basically meet the requirements, but if it is 100,000, million, tens of millions, billion level? Red and black trees these algorithms can barely cope with, 1 billion, billions, billions, trillions of levels? Index only.
Baidu Technical Committee Chairman Chen Shangyi revealed, "The amount of data processed by Baidu every day nearly 100 PB,1PB equals 1 million g, equivalent to 5,000 National Library of the sum of information."
Such huge data, Baidu's technical strength can be seen.
In addition to search engines in fact many scenarios are applied to the crawler technology. such as the emerging public opinion analysis system, data mining system and so on.
Now more and more enterprises are aware of the importance of data, reptiles as an important data source, the future will certainly be in more areas to be applied.
2. Chinese participle--data preprocessing
Chinese participle is also an important search engine technology, word segmentation is accurate directly related to the query structure to meet the search intention of the searcher
Chinese word segmentation relative to English participle is much more difficult, because the English language has a natural separator, each word is a meaning. such as "Wendao Software Studio" can be easily divided into "Wendao", "Software", "studio" three words. For the corresponding Chinese "smell the software studio", you can have "smell/road/soft/pieces/work/work/room", "Shandao/soft/pieces/job/room", "Shandao/software/studio" and many other kinds of methods.
Chinese participle is a need to study a very in-depth field, of course, there are some relatively good Chinese thesaurus, greatly simplifying the work of developers.
3, full-text search-Data preprocessing
Indexing is an essential method for querying large amounts of data. For indexed data, we can search for the same data from massive data in a very short time.
To make it easier to understand, we can think of the index as a catalogue of books, and we can quickly find what we are interested in in a short period of time without having to look through the pages.
Full-text search needs to be completed after the Chinese word segmentation, you need to divide an article into a keyword and then set up the index, so as to achieve the purpose of searching from the content of the article.
4, sorting--data preprocessing
Sorting is a very important part of the search engine, sorting unreasonable also will greatly hurt the user experience, and many webmasters in order to improve their rankings and have a lot of cheating means, which makes the development of sorting algorithms more difficult.
Search engines can get a number of parameters, regardless of how the sorting algorithm changes, it is only to adjust the weight of these parameters, the following list of two important parameters.
A), content
Now the search engine attaches great importance to the user experience, so this will be the most important parameter for all impact rankings.
How to judge the content quality of a website? Originality is an important standard. Comparison of the common original Judgment algorithm is based on the space vector cosine algorithm, the algorithm based on the frequency and weight of keywords, for many to do false original webmaster, this is worth studying.
b), outside the chain
The chain is still an important criterion for evaluating the quality of a website by the search engine. No longer repeat.
5, query--Data display
Many people think that Baidu, Google and other search engines in such a short period of time to find results in a large number of data, the query algorithm must be very complex, it is not. Instead, this is the simplest part of search engine technology. They are fast because, after a few previous steps, they have already prepared the data for your query.
Original: http://www.wendaoruanjian.com/?p=38