[Natural Language Processing] text vectorization technology

Source: Internet
Author: User
Tags idf

Pre-Preparation

The premise of using text vectorization is to make word segmentation for the article, which can refer to the previous article. The words are then quantified so that the computer can recognize the text. The common text vectorization technique has the word frequency statistic technique, theTF-IDF technology and so on.

statistical techniques of Word frequency

The word frequency statistic technique is very intuitive, after the text is participle. use every word as a dimension Key , there are words corresponding to the position of 1 , the other for 0 , the vector length is the same as the dictionary size. each dimension is then used as the weight of the word frequency. Word frequency statistic Technology the higher the default frequency, the greater the weight of the words.

To illustrate:

Original:

Sentence A: I like watching TV and don't like watching movies.

Sentence B: I don't like watching TV, and I don't like watching movies.

Participle Result:

Sentence A: I / like / watch / TV, no / like / watch / movie.

Sentence B: I / don't / like / watch / TV, also / No / like / See / film.

List dimensions: I, like, watch, TV, movie, No, also .

Statistical frequency:

     sentence a : I   , like   2 2 , TV   , movie   Span style= "FONT-FAMILY:CALIBRI;" >1 , also  

Sentence B: I am 1, like 2, see 2, TV 1, movie 1 , not 2 , also 1 .

Convert to Vector:

Sentence A:[1, 2, 2, 1, 1, 1, 0]

Sentence B:[1, 2, 2, 1, 1, 2, 1]

Can be seen: frequency statistics technology is intuitive and simple. But there are obvious flaws: some words in Chinese, such as "I", "the" appear very high frequency, so will give higher weights, but these words are meaningless. Therefore, in order to use the word frequency statistic technique, it is necessary to introduce the discontinued words to filter these meaningless words.

TF-IDF Technology

TF-IDF Technology is designed to overcome the shortcomings of the word frequency statistics technology, it introduces the concept of "inverse document Frequency", which measures the common degree of a term,tf-idf hypothesis is: if a word or phrase appears in an article high frequency, and rarely in other articles, it is likely to reflect the nature of the article, so increase its weight.
TF-IDF technology needs to maintain a corpus or set of files used to calculate the frequency of occurrences of each word, the higher the frequency of the inverse of the document frequency smaller. A corpus can be a collection of regulations for the entire railway, or it can be the full text of a regulation. Practice has proved thatTF-IDF in the participle, also need to remove the obvious stop words, so the effect will be better.

For example, in the case of railway regulations, the word word "train" is bound to be very high in the text, but the frequency in its corpus will be very high, so its weight will decrease.

[Natural Language Processing] text vectorization technology

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.