Pre-Preparation
The premise of using text vectorization is to make word segmentation for the article, which can refer to the previous article. The words are then quantified so that the computer can recognize the text. The common text vectorization technique has the word frequency statistic technique, theTF-IDF technology and so on.
statistical techniques of Word frequency
The word frequency statistic technique is very intuitive, after the text is participle. use every word as a dimension Key , there are words corresponding to the position of 1 , the other for 0 , the vector length is the same as the dictionary size. each dimension is then used as the weight of the word frequency. Word frequency statistic Technology the higher the default frequency, the greater the weight of the words.
To illustrate:
Original:
Sentence A: I like watching TV and don't like watching movies.
Sentence B: I don't like watching TV, and I don't like watching movies.
Participle Result:
Sentence A: I / like / watch / TV, no / like / watch / movie.
Sentence B: I / don't / like / watch / TV, also / No / like / See / film.
List dimensions: I, like, watch, TV, movie, No, also .
Statistical frequency:
sentence a : I , like 2 2 , TV , movie Span style= "FONT-FAMILY:CALIBRI;" >1 , also
Sentence B: I am 1, like 2, see 2, TV 1, movie 1 , not 2 , also 1 .
Convert to Vector:
Sentence A:[1, 2, 2, 1, 1, 1, 0]
Sentence B:[1, 2, 2, 1, 1, 2, 1]
Can be seen: frequency statistics technology is intuitive and simple. But there are obvious flaws: some words in Chinese, such as "I", "the" appear very high frequency, so will give higher weights, but these words are meaningless. Therefore, in order to use the word frequency statistic technique, it is necessary to introduce the discontinued words to filter these meaningless words.
TF-IDF Technology
TF-IDF Technology is designed to overcome the shortcomings of the word frequency statistics technology, it introduces the concept of "inverse document Frequency", which measures the common degree of a term,tf-idf hypothesis is: if a word or phrase appears in an article high frequency, and rarely in other articles, it is likely to reflect the nature of the article, so increase its weight.
TF-IDF technology needs to maintain a corpus or set of files used to calculate the frequency of occurrences of each word, the higher the frequency of the inverse of the document frequency smaller. A corpus can be a collection of regulations for the entire railway, or it can be the full text of a regulation. Practice has proved thatTF-IDF in the participle, also need to remove the obvious stop words, so the effect will be better.
For example, in the case of railway regulations, the word word "train" is bound to be very high in the text, but the frequency in its corpus will be very high, so its weight will decrease.
[Natural Language Processing] text vectorization technology