Same enthusiasts please add
qq:231469242
SEO Keywords
Natural language, Nlp,nltk,python,tokenization,normalization,linguistics,semantic
Study Reference book: http://nltk.googlecode.com/svn/trunk/doc/book/
http://blog.csdn.net/tanzhangwen/article/details/8469491
A NLP Enthusiast Blog
http://blog.csdn.net/tanzhangwen/article/category/1297154
1. downloading data using a proxy
Nltk.set_proxy ("**.com:80")
Nltk.download ()
2. Use the sents (Fileid) function when it appears: Resource ' tokenizers/punkt/english.pickle ' not found. NLTK Downloader to obtain the resource:
Import NLTK
Nltk.download ()
Install the ' Models ' item in the installation window, then ' Find ' Punkt in ' Identifier ' column, click Download to install the packet
3. Corpus Corpus Element Acquisition function
From Nltk.corpus import Webtext
Webtext.fileids () #得到语料中所有文件的id集合
Webtext.raw (Fileid) #给定文件的所有字符集合
Webtext.words (Fileid) #所有单词集合
Webtext.sents (Fileid) #所有句子集合
Example |
Description |
Fileids () |
The files of the corpus |
Fileids ([categories]) |
The files of the corpus corresponding to these categories |
Categories () |
The categories of the corpus |
Categories ([Fileids]) |
The categories of the corpus corresponding to these files |
Raw () |
The raw content of the corpus |
Raw (FILEIDS=[F1,F2,F3]) |
The raw content of the specified files |
Raw (CATEGORIES=[C1,C2]) |
The raw content of the specified categories |
Words () |
The words of the whole corpus |
Words (FILEIDS=[F1,F2,F3]) |
The words of the specified fileids |
Words (CATEGORIES=[C1,C2]) |
The words of the specified categories |
Sents () |
The sentences of the whole corpus |
Sents (FILEIDS=[F1,F2,F3]) |
The sentences of the specified fileids |
Sents (CATEGORIES=[C1,C2]) |
The sentences of the specified categories |
Abspath (Fileid) |
The location of the given file on disk |
Encoding (Fileid) |
The encoding of the file (if known) |
Open (Fileid) |
Open a stream for reading the given corpus file |
Root () |
The path to the root of locally installed corpus |
Readme () |
The contents of the README file of the corpus |
4. Some common functions of text processing
If text is a list of Word collections
Len (text) #单词个数
Set (text) #去重
Sorted (text) #排序
Text.count (' a ') #数给定的单词的个数
Text.index (' a ') #给定单词首次出现的位置
Freqdist (text) #单词及频率, keys () is the word, *[key] gets the value
Freqdist (text). Plot (50,cumulative=true) #画累积图
Bigrams (text) #所有的相邻二元组
Text.collocations () #找文本中频繁相邻二元组
Text.concordance ("word") #找给定单词出现的位置及上下文
Text.similar ("word") #找和给定单词语境相似的所有单词
Text.common_context ("A", "B") #找两个单词相似的上下文语境
Text.dispersion_plot ([' A ', ' B ', ' C ',...]) #单词在文本中的位置分布比较图
Text.generate () #随机产生一段文本
NLTK ' s Conditional Frequency distributions:commonly-used methods and idioms for defining,accessing, and visualizing a con Ditional Frequency distribution.of counters.
Example |
Description |
Cfdist = conditionalfreqdist (Pairs) |
Create a conditional frequency distribution from a list of pairs |
Cfdist.conditions () |
Alphabetically sorted list of conditions |
Cfdist[condition] |
The frequency distribution for this condition |
Cfdist[condition][sample] |
Frequency for the given sample for this condition |
Cfdist.tabulate () |
Tabulate the conditional frequency distribution |
Cfdist.tabulate (samples, conditions) |
Tabulation limited to the specified samples and conditions |
Cfdist.plot () |
Graphical plot of the conditional frequency distribution |
Cfdist.plot (samples, conditions) |
Graphical plot limited to the specified samples and conditions |
Cfdist1 < Cfdist2 |
Test if samples in Cfdist1 occur less frequently than incfdist2 |
To is Continued
Common functions of natural language 2_