Use NLTK to analyze your diary. Get the following results (excerpt)
' \xb8\xb0 ', ' \xe5\xbc\xba\xe8\xba ', ' \xe5\xbd\xbc\xe5 ', ' \xb8\xb4 ', ' \xb8\x8a ', ' \xb8\x8b ', ' \xb8\x88 ', ' \xb8\x89 ', ' \ xb8\x8e ', ' \xb8\x8f ', ' \xb8\x8d ', ' \xb8\x82 ', ' \xb8\x83 ', ' \xb8\x80 ', ' \xb8\x81 ', ' \xb8\x87 ', ' tend ', ' \xb8\x9a ',
What methods and tools can I recommend for natural language analysis of Chinese?
Reply content:
Recently, we are using NLTK to classify the Chinese online product reviews, calculate the information entropy (entropy), the mutual information (Point Mutual information) and the confusion value (perplexity), etc. Only NLTK provided the appropriate method).
I feel that using NLTK to handle Chinese is completely available. The emphasis is on the Chinese word segmentation and the form of text expression.
The main difference between Chinese and English is that Chinese requires participle. Because NLTK processing granularity is generally the word, so you must first to the text word segmentation and then use NLTK to deal with (do not need to use NLTK to do participle, directly with the word breaker can be. It is very useful to have a stutter participle.
After Chinese participle, the text is a long array consisting of each word: [Word1, Word2, Word3 ... wordn]. You can then use the various methods inside the NLTK to process the text. For example, with freqdist statistical text word frequency, using bigrams to turn text into a two-phrase form: [(Word1, Word2), (Word2, Word3), (Word3, Word4) ... (Wordn-1, Wordn)].
After that, we can use these to calculate the information entropy and mutual information of the text words.
Then you can use these to choose the characteristics of machine learning, to build a classifier, the text classification (product review is a multi-dimensional array of independent comments, there are many examples of emotional classification on the internet is NLTK in the product review corpus, but in English. But the whole idea can be consistent).
There is also a problem with Python's Chinese encoding that bothers many people. After many failures I summed up some experience.
Python solves the Chinese coding problem basically with the following logic:
UTF8 (input)--Unicode (processing)--(output) UTF8
The characters processed in Python are all Unicode encodings, so the solution to the coding problem is to decode the input text (whatever encoding) into the (decode) Unicode encoding, and then encode (encode) the encoding when it is output.
Because the process is generally txt document, the simplest way is to save the TXT document as Utf-8 encoding, and then use Python processing to decode to Unicode (Sometexts.decode (' UTF8 ')), output the result back to TXT Then encode it into UTF8 (directly with the STR () function).
The landlord encountered only the problem of coding ...
There are a lot of useful Chinese processing packages:
Jieba: Can be used to do word segmentation, pos tagging, Textrank
HANLP: participle, named entity recognition, dependency parsing, and Fudannlp,nlpir
Personal feel is better than NLTK use ~ Chinese participle with stuttering, I made a small example nltk-compare Chinese document similarityYou say this has nothing to do with NLTK, change Python3, there is no such ghosts! Chinese still have to utf8!
Big Love nltk! Other packages, in addition to fixed tasks, Java even, using: Text.decode (' GBK ')
Participle: You find the corresponding Chinese word-breaker packagehttps://Github.com/fxsjy/jiebaBecause NLTK can not be the reason for Chinese word segmentation, recently also learning this aspect of things, recommend a tool Chinese tool, you can study I encountered the same problem, in the "Python Natural Language Processing" a book, after the successful loading of their own documents, but see the Chinese as you have shown, it should be the encoding setup problem, but do not know where to set. There's too little information in this area.