Lan Shang: A method of weight design used in search engine development

Source: Internet
Author: User
Keywords Search engines words

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

The last time we discussed the vector space model of information retrieval, we discussed the design of the weight of the word today, mainly from the traditional method of weight calculation.

Under normal circumstances for the weight of the design, the first thought to use is the word frequency. In the Web page document, if a word appears more frequently, so often this word in the Web page document of the more important, the frequency we use T (term frequency) to express, word frequency and weight value is proportional relationship; then we'll judge the collection of all the documents, Analyze the frequency of the word appearing in the document, the more documents include the word, the less exclusive degree of the word, can clearly distinguish between the words of the document, often important degree is higher, exclusive degree is higher, so it seems that the article frequency and the weight of words is an inverse relationship, the usual algorithm, The design weights are calculated using the reciprocal of the document frequency (inverse document frequency).

Word frequency is also affected by the number of words in the document itself, so often to do before the calculation of the normalization process, there are many methods, I remember the original look at the data on the specific introduction of the method called the cosine code method, this method uses more, He uses the word weight of each vector divided by the Euclidean length of the document vector, to illustrate here, the word weight value mentioned here refers to the weight value given to each word in the analysis of the content in the vector space model article, and the result of our final calculation is the final weight value of the word. Euclidean length (Euclidean norm), also called Euclidean norm, the Euclidean norm of the vector is the sum of the square of the absolute value of the element, such as x is the n-dimensional vector (x1,x2,..., xn), then Euclidean norm is | | x| | =sqrt (x1^2+x2^2+...+xn^2). Read here if you feel difficult to understand a friend you can simply record this formula can be, more about Euclid's mathematical concepts can go to search or find information to look at. Other standard methods also include the maximum frequency code, logarithmic frequency code, and so on, there are other norms have research friends hope to provide some information to learn.

The current search engine Word weight calculation method In addition to the above methods, there are many other improvements, such as term frequency collection,length term collection and information entropy and so on, interested friends can also go to search the relevant information.

The weight of the word design there are many other factors to consider, if it is to carry out search engine development also need other factors to consider, need to use more algorithms, here put forward a simple design method for the search engine to deal with the words of some manipulation of understanding, Hope that you can understand more deeply the search engine plays a certain role.

Author: Lan Shang seo Team-tony

Original load: http://www.lention.com.cn/blog/

Copyright Notice: Original works, allow reprint, reprint, please be sure to hyperlink form to indicate the original source of the article, author information and this statement. Otherwise, legal liability will be held.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.