Baidu Word Segmentation technology for websites

Source: Internet
Author: User

Baidu claims to be the world's largest Chinese search engine and understands the search habits of Chinese netizens. As the leading Chinese search engine, many grass-roots Webmasters have been studying its search technology and rankings. Algorithm There are also many webmasters and sisters who want to rely on their traffic to eat. For this reason, webmasters of grassroots websites love and hate it. Baidu can bring a lot of traffic, and will be heartless K stations or downgrading.

Okay. Let's talk about Baidu's Word Segmentation technology. Maybe the younger brother has something wrong. Please make a picture and hope you can make common progress. Without studying Baidu word segmentation, I don't understand why Baidu is more powerful than Google. Actually, Word Segmentation means word segmentation. Do Baidu take a Chinese string and just cut it apart? Of course not. So how can we meet the cut conditions? Through the following experiment, we will find that if the string contains only less than or equal to three Chinese characters, it will remain unchanged. For example, if the word "tractor" is a complete word, baidu does not split the string. When the string length is greater than 4 Chinese characters, Baidu's Word SegmentationProgramIt starts. For example, "Tractor price". If you look at the marked red text in the returned results, the query is cut into two words: "tractor" and "price, try another word. For example, let's take a look at "Hengshui tractor". Baidu submitted a query and found that every keyword marked with red is "tractor" and "Hengshui". continuous occurrence, we can see that the "tractor" and "Hengshui" are divided into two words. If we search for "Hengshui Yikang Medical Devices", we find that there is no title, the content on the webpage is relatively complete: "Hengshui Yikang medical devices". It may be the reason why this website has not yet been indexed. This long tail word is cut into "Hengshui"/"Kang Yikang medical devices ".

Now, let's take a look at Baidu as a word segmentation algorithm. After several years of development, Baidu's word segmentation algorithm has become quite mature, which is also indispensable for seoer's credit, a predecessor of seoer said: "Baidu's algorithms are simple and complex, including forward and maximum matching, reverse and maximum matching, bidirectional maximum matching, language model method, and shortest path algorithm, if you are interested, you can search by Google to increase understanding. "Let's sum up. There is also my personal guess here. Baidu has a powerful vocabulary database which contains all the words we use, such as" Name of person, place name, product Name ". For example," Wang yuanzhe "," Hengshui ", and" tractor "are all in the word library. When you split the word, extract the proprietary name, for example, the word "Kang" does not exist in the dictionary. The following two-way maximum matching word segmentation algorithm is used. If the positive and reverse matching word segmentation results are consistent, the result can be directly output; if the two are inconsistent, a positive match is performed, and a reverse match is performed.

Baidu has always claimed to be the world's largest Chinese search engine, but the word segmentation technology is not special, maybe because Baidu has a very large specialized dictionary, this specialized dictionary has logged on to the name (such as Filter Press ), manufacturer name (such as filter press manufacturer), some place names (such as Fucheng), web site (http://www.xiaoshuo8.net/), and this dictionary is constantly expanding some new words, some new words are constantly indexed, this is a more powerful aspect than Google. In terms of word segmentation, Google is obviously inferior to Baidu in terms of word segmentation. This is exactly where Google itself needs to strengthen.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.