SNOWNLP is a python-written class library that can easily handle Chinese text content. such as Chinese word segmentation, POS tagging, affective analysis, text categorization, extraction of text keywords, text similarity calculation.
#-*-Coding:utf-8-*-from SNOWNLP import snownlp s = SNOWNLP (' This thing is really awesome ') print (' Chinese participle: ') print (s.words) # [u ' this ', U ' Things ', U ' true ', # u ' very ', U ' Praise '] print () print (' pos: ') print (s.tags) # [(U ' This ', U ' r ') n '), # (U ' true ', U ' d '), (U ' very ', U ' d '), # (U ' praise ', U ' Vg ')] print (' Emotion Analysis: ') print (s.sent iments) # 0.9769663402895832 positive probability print () print (' Convert to Pinyin: ') #汉转拼音 print (s.pinyin) # [u ' Zhe ', U ' ge ', U ' Dong ', U ' XI ', # u ' zhen ', U ' xin ', U ' hen ', u ' Zan ' Print () print (' Traditional to Simplified: ') s = SNOWNLP (' The Chinese language "is also very common in Taiwan.) #简转繁 print (S.han) # ' Traditional Chinese characters ' is also common in Taiwan.
' Print () Text = ' Natural language processing is an important direction in the field of computer science and artificial intelligence.
It studies various theories and methods that can realize effective communication between human and computer using natural language.
Natural language processing is a science which integrates linguistics, computer science and mathematics.
Therefore, the research in this field will involve natural language, that is, the language that people use everyday, so it is closely related to the study of linguistics, but it has important difference. Natural language processing is not the study of natural language in general, but the development of computer systems which can effectively realize natural language communication, especially the software system.
So it's part of computer science. ' s = SNOWNLP (text) print (' Extract text keywords: ') print (S.keywords (3)) # [' Language ', ' nature ', ' computer '] print () prinT (' Extract text summary: ') print (S.summary (3)) # [' Thus it is part of Computer science ', # ' natural language processing is a branch of linguistics, Computer Science, # Mathematics. Learning ', # ' natural language processing is an important direction in the field of computer science and Artificial Intelligence # "print () print (' Split into sentence: ') print ( s.sentences) print () s = SNOWNLP ([' This article ', ' article ', ' true ', ' good '], [' That article ', ' paper '], [' This ']] print (' word frequency: ') pr Int (S.TF) #词频 print () print (' Reverse file frequency: ') print (S.IDF) #逆向文件频率 print () print (' text similar: ') print (S.sim [' article ']) # [0.37560707629 85226, 0, 0] print (S.sim ([' article ', ' True ']) # [0.7731414846187967, 0, 0]
Output:
Chinese participle:
[' This ', ' thing ', ' sincerity ', ' very ', ' Praise ']
pos annotation:
<zip object at 0x12638b388>
affective analysis:
0.9769551298267365 to
pinyin:
[' Zhe ', ' ge ', ' dong ', ' XI ', ' zhen ', ' xin ', ' hen ', ' zan ']
traditional simplified: "
Traditional Chinese" The term "Traditional Chinese" is also common in Taiwan.
Extract text keywords:
[' Language ', ' nature ', ' computer ']
Extract Text summary:
[' Thus it is part of computer science ', ' natural language processing is an important direction in the field of computer science and Artificial intelligence ', ' natural language processing is a science that integrates linguistics, computer science and Mathematics ']
into sentences:
[' Natural language processing is an important direction in the field of computer science and Artificial intelligence ', ' it studies a variety of theories and methods that enable effective communication between people and computers in natural language, ' and ' natural language processing is a science of linguistics, computer Science and mathematics ', ' so ' ' Research in this field will involve natural language ', ' the language that people use everyday ', ' so it is closely related to the study of linguistics ', ' but there are important differences ', ' natural language processing is not a general study of natural language ', ' but the development of computer systems that can effectively realize natural language communication ', ' Especially the software system ', ' thus it is part of the computer Science '] Word
frequency:
[{' This article ': 1, ' article ': 1, ' true ': 1, ' Good ': 1}, {' That ': 1, ' thesis ': 1}, {' This ': 1}]
reverse file Frequency:
{' This article ': 0.5108256237659907, ' article ': 0.5108256237659907, ' true ': 0.5108256237659907, ' good ': 0.5108256237659907, ' that article ': 0.5108256237659907, ' thesis ': 0.5108256237659907, ' this ': 0.5108256237659907}
text similar:
[0.38657074230939836, 0, 0]
[0.7731414846187967, 0, 0]
About training (participle, POS tagging, affective analysis):
From SNOWNLP import seg
seg.train (' data.txt ')
seg.save (' Seg.marshal ')
# from SNOWNLP import Tag
# Tag.train (' 199801.txt ')
# tag.save (' Tag.marshal ')
# from SNOWNLP import Sentiment
# Sentiment.train (' Neg.txt ', ' Pos.txt ')
# sentiment.save (' Sentiment.marshal ')
PS: The training of the file is stored as Seg.marshal, and then modify the snownlp/seg/__init__.py in the Data_path point to just training good files can
or point to your own training address.