How to Use NLTK in Python to analyze and process Chinese characters?

Source: Internet
Author: User
Tags nltk
Use nltk to analyze your own diary. Obtain the following results (excerpt): analyze, xb8xb4, xb8x8a, xb8x8b, xb8x88, cosine, xb8x8d, xb8x82, and xb8x83. Use nltk to analyze your. Get the following results (excerpt)

'\ Xb8 \ xb0',' \ xe5 \ xbc \ xba \ xe8 \ xba', '\ xe5 \ xbd \ xbc \ xe5',' \ xb8 \ xb4 ', '\ xb8 \ x8a', '\ xb8 \ x8b', '\ xb8 \ x88', '\ xb8 \ x89', '\ xb8 \ x8e ', '\ xb8 \ x8f',' \ xb8 \ x8d ',' \ xb8 \ x82 ',' \ xb8 \ x83 ',' \ xb8 \ 8080 ', '\ xb8 \ x81', '\ xb8 \ x87', 'tend', '\ xb8 \ x9a ',

Which methods and tools can be recommended for analyzing natural language in Chinese? Reply: nltk is recently used to classify Chinese online product reviews, calculate the comments information entropy (entropy), mutual information (point mutual information), and confusion value (perplexity) and so on (but I still don't understand these concepts very well... only nltk provides the corresponding method ).

I feel that nltk is fully available for Chinese processing. It focuses on Chinese Word Segmentation and text expression forms.
The main difference between Chinese and English is that Chinese must be segmented. Because the processing granularity of nltk is generally a word, you must first perform word segmentation on the text and then use nltk for processing. (You do not need to use nltk for word segmentation. You can simply use a word segmentation package. It is highly recommended to use the jieba word splitting function ).
After the Chinese word segmentation, the text is a long array composed of each word: [word1, word2, word3 ...... Wordn]. Then you can use various methods in nltk to process the text. For example, use FreqDist to calculate the word frequency of a text, and use bigrams to convert the text into two phrases: [(word1, word2), (word2, word3), (word3, word4 )...... (Wordn-1, wordn)].
Then, the information entropy and mutual information of text words can be calculated.
Then, you can use these features to select machine learning features, build classifier, and classify texts (product reviews are multidimensional arrays composed of multiple independent comments, there are many examples of sentiment classification on the Internet that use the product comment corpus in nltk, but only in English. But the entire idea can be consistent ).

There is also a Chinese Python encoding problem that troubles many people. I have summarized some experiences after multiple failures.
Python can use the following logic to solve Chinese Encoding Problems:
Utf8 (input) --> unicode (processing) --> (output) utf8
All characters processed in Python are unicode encoded. Therefore, the solution to the encoding problem is to decode the input text (no matter what encoding) to (decode) unicode encoding, then, encode (encode) the output to the required encoding.
Since the processing is generally a txt file, the simplest method is to save the txt file as UTF-8 encoding, and then decode it as unicode (sometexts. decode ('utf8'). encode the output result to utf8 when it is returned to the txt file (directly use the str () function ).

In addition, this article also has a very detailed description of nltk Chinese application, it is worth reference: http://blog.csdn.net/huyoo/article/details/12188573 What the landlord encounters is the encoding problem...
There are many useful Chinese processing packages:
Jieba: Can Be Used for word segmentation, part-of-speech tagging, and TextRank
HanLP: Word Segmentation, Named Entity recognition, dependency Syntactic Analysis, FudanNLP, NLPIR
I personally think it is better than NLTK ~ The Chinese word segmentation can be completed through sticky. I have made a small example nltk-Comparison of Chinese Document similarity. You said this has nothing to do with NLTK. For Python3, there will be no such ghosts! UTF8 is required for Chinese!
Love NLTK! Other packages, except for fixed tasks, are used in java: text. decode ('gbk ')
Word Segmentation: You find the corresponding Chinese Word Segmentation package https://github.com/fxsjy/jieba Because nltk cannot perform word splitting on Chinese, we are also learning about this recently. We recommend a tool for Chinese processing. I have encountered the same problem. After reading the book "Python natural language processing" and successfully loading my own documents, I can see the Chinese in it, as shown in you, it should be about encoding settings, but I don't know where to set it. There is too little information in this area.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.