Hal is what I do not explain, specific can refer to: http://www.zhan5zhan.com/post/6.html
1. What is short text
Forums, blogs, microblogs, chat records, and Q & A can all be considered short articles. Although blogs and forums have many long texts, they are a minority.
2. Difficulties in short text
1) Non-standard and colloquial. For example, various abbreviations and typos.
2) Lack of context. In professional forums, it is hard to understand all the special terms. For example, the "Monk" in Diablo 3 refers to a role, and "Mami Love" is a baby drug.
3. Solution: Supplement context and background knowledge
Abbreviations, mistakes, special characters, and isolated words must be placed in a complete context to be understood. How to construct a bag of words to supplement the short text is the key to the problem.
4. Hal Method
Hal finds out that words in the word matrix complement each other with a large number of co-occurrence times between words. The above link provides an example.
5. Phal Method
The supplement to Hal determines whether the co-occurrence word can constitute an "interpretation" Relationship to the original word. There are two factors: closer and closer to the original word, the closer the relationship is, the more times the two appear together, the stronger the relationship.
Therefore, Phal increases the co-occurrence probability and co-occurrence distance compared with Hal.
S (WI | W) = P (WI | W)/L (WI | W)
This is the co-occurrence formula. The larger the probability and the shorter the distance indicate the closer the relationship between the two words.
6. The following are some interesting examples.
- Xiangyue 12 hotel | 0.149390 | 0.30 | 2.0 Beijing | 0.041757 | 0.13 | 3.1 Beijing | 0.027999 | 0.09 | 3.3 | 0.014967 | 0.05 | 3.7 | 0.014967 | 0.05 | 3.7 | 0.010671 | floor | | 0.02 | 2.0 reservation | 0.010540 | 0.03 | 3.2 Price | 0.006499 | 0.03 | 4.2 accommodation | 0.006499 | 0.03 | 4.2 price | 0.006499 | 0.03 | 4.2 |
Hotel | 0.003430 | 0.02 | 5.3 | 0.002217 | 0.01 | 5.5
-
- Blood ridge sniper 7 plot | 0.071429 | 0.14 | 2.0 | 0.047619 | 0.14 | 3.0 | 0.047619 | 0.14 | 3.0 | 0.047619 | 0.14 | 3.0 | 0.035714 | 0.14 | 4.0 | starring | 0.028571 | 0.14 | 5.0 Shi tailong | 0.023810 | 0.14 | 6. 0
-
- Da Lian 2 Doctor | 0.250000 | 0.50 | 2.0 Suzhou | 0.250000 | 0.50 | 2.0 |
- Journal of forest diseases and insects of China | 0.250000 | 0.50 | 2.0 | 0.166667 | 0.50 | 3.0
-
- Nameen 1 home textile | 0.500000 | 1.00 | 2.0
-
- Haidian road 3 Chen Shufen | 0.111111 | 0.33 | 3.0 | 0.083333 Chinese Medicine Clinic | 0.33 | 4.0 | 0.066667 | 0.33 migrated | 5.0 |
For example, if you are not familiar with NAMAs, you certainly don't know what it is, but the "home textiles" mentioned later shows that it is a home textile brand. For example, if you don't know, you may think it is a game, a movie, a TV series, or a novel. In fact, he is a movie starring Shi tailong, it is often downloaded by thunder.
7. What is the purpose? OK, this is very useful. It can be used to expand short articles and then for classification, clustering, recommendation system, similarity calculation, semantic understanding, sentiment analysis, public opinion and public opinion, anti-spam .....