Analysis of Baidu's search engine-related technologies

Source: Internet
Author: User

1. crawler-Data Source
 
As a source of massive data from search engines, crawlers are an important part of search engine technology. The Wentao software studio has its own crawler, so it is very familiar with this technology.
 
Crawlers are translated into Spider, which is easier to understand. The links of countless websites constitute a huge network, the content collection program of the search engine is like a hardworking spider crawling on the Internet. Every time a node is interested, it is recorded for processing by other programs.
 
The implementation of crawlers is actually not difficult. I used C ++ to develop a set of crawlers with only about 500 lines of code, but less than 100 lines in Python.
 
However, as long as any program involves massive data processing, the development difficulty and development cycle will become very great. Let's take a simple example to determine whether a link has been crawled. This is what crawlers do after each link is analyzed. If your memory has only several thousand or tens of thousands of links at this time, even the traversal and comparison of individual entries can basically meet the requirements. What if it is 100,000, or million levels? The algorithms of red and black trees are barely able to cope with. What about billions, billions, hundreds of billions, and trillions of algorithms? Only indexes can be created.
 
Chen Shangyi, Chairman of Baidu technical committee, said, "Baidu processes nearly 100 pb of data every day, and 1 Pb is equal to 1 million GB, which is equivalent to the total information of 5000 national libraries ".
 
The technical strength of Baidu is evident from the massive data.
 
In addition to search engines, crawlers are used in many scenarios. Such as the emerging public opinion analysis system and data mining system.
 
More and more enterprises are aware of the importance of data. As an important data source, crawlers will certainly be applied in more fields in the future.
 
2. Chinese Word Segmentation-data preprocessing
 
Chinese Word Segmentation is also an important technology in search engines. Whether word segmentation is accurate is directly related to whether the query structure meets the search intent of the searcher.
 
Chinese Word Segmentation is much more difficult than English word segmentation, because English has a natural separator, each word is a meaning. for example, "wendao software Studio" can be divided into three words: "wendao", "software", and "Studio" based on spaces. For the corresponding Chinese "Wen Dao software Studio ", there can be "Wen/Dao/soft/piece/work/room", "Wen Dao/soft/piece/work/room", "Wen Dao/software/Studio", and many other types method.
 
Chinese Word Segmentation is a field that requires in-depth research. Of course, there are some good Chinese word segmentation, which greatly simplifies developers' work.
 
3. Full-text retrieval-data preprocessing
 
Index creation is an essential and important method for querying large amounts of data. For data with indexes, we can search for the same data from massive data within a short period of time.
 
For ease of understanding, we can think of indexes as a directory of a book. With a directory, we can quickly find the content we are interested in a short period of time, instead of opening one page to find it.
 
Full-text search can be completed only after Chinese word segmentation. You need to divide an article into keywords and create indexes respectively to search for the content of the article.
 
4. Sorting-data preprocessing
 
Sorting is a very important part of search engines. unreasonable sorting can also greatly damage user experience. Many Webmasters have many cheating methods to improve their rankings, this makes the development of sorting algorithms more difficult.
 
The search engine can obtain several parameters. No matter how the sorting algorithm changes, it only adjusts the weights of these parameters. The following lists two important parameters.
 
A) Content
 
The current search engine attaches great importance to user experience, so this will be the most important parameter that affects rankings.
 
How can we determine the content quality of a website? Originality is an important standard. The common original degree judgment algorithms include the cosine algorithm based on spatial vectors. The algorithm is based on the frequency and weight of keywords. For many webmasters who are pseudo-original, this is worth studying.
 
B) External links
 
External links are still an important criterion for the search engine to evaluate the quality of a website. I will not go into details here.
 
5. query-Data Display
 
Many people think that Baidu, Google, and other search engines can find results in massive data within such a short period of time, and the query algorithm is difficult. On the contrary, this is the simplest part of search engine technology. They are fast because they have prepared data for your query after the previous steps.

This document consistsZhangpu ForumHttp://www.dazhangpu.com webmaster original, respect the author's intellectual property rights, reproduced please keep this information, thank you!

Analysis of Baidu's search engine-related technologies

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.