Dry Foods! Details how to use the Stanford NLP Toolkit under Python nltk
Bai Ningsu
November 6, 2016 19:28:43
Summary:NLTK is a natural language toolkit implemented by the University of Pennsylvania Computer and information science using the Python language, which collects a large number of public datasets and provides a comprehensive, easy-to-use interface on the model, covering participle, The functions of part-of-speech tagging (Part-of-speech tag, Pos-tag), named entity recognition (Named entity recognition, NER), syntactic analysis (syntactic Parse), and other NLP domains. and stanford NLP is a NLP toolkit implemented by the open source Java Group of the Stanford University's NLP team, which also provides solutions to the various problems in the NLP domain. This analysis is very convenient and useful. This article mainly introduces NLTK (Natural language Toolkit) under Configuration installation stanford NLP , and the standford NLP Core module to demonstrate, so that the reader can easily understand the knowledge of this chapter, follow-up will continue to use the Great Qin Empire corpus to the word segmentation, POS tagging, named entity recognition, syntactic analysis, syntactic dependency analysis for a detailed demonstration. For Python basics, see the "Python Five slow Play" article (
< Span style= "color: #ff0000;" * Original authoring, reproduced annotated source
: dry! Detailing how Python NLTK uses the Stanford NLP Toolkit )
Catalogue
Python NLTK into the Great Qin Empire
Dry! Details how to use the Stanford NLP Toolkit under Python NLTK
1 NLTK and STANDFORDNLP introduction
NLTK: A Natural language toolkit implemented by the University of Pennsylvania Computer and information science using the Python language, which collects a large number of public datasets and provides a comprehensive, easy-to-use interface on the model, covering word segmentation, part-of-speech tagging (Part-of-speech tag, Pos-tag), named entity recognition (Named entity recognition, NER), syntactic analysis (syntactic Parse), and other functions of the NLP domain.
Stanford NLP: The NLP Toolkit, implemented by open source Java, by the NLP group at Stanford University, also provides solutions to the various problems in the NLP domain. the NLP team at Stanford University is a world-renowned research group that can combine NLTK and Stanford NLP with two toolkits, which is great for natural language developers! In 2004, Steve Bird added support for the Stanford NLP Toolkit in NLTK by calling external jar files to use the functionality of the Stanford NLP Toolkit. This analysis seems to be very convenient and useful.
This article provides the following features in Stanford NLP in the main introduction to NLTK:
- Chinese and English participle: stanfordtokenizer
- Chinese and English pos notation: Stanfordpostagger
- Chinese and English named entity recognition: Stanfordnertagger
- Syntactic analysis in Chinese and English : Stanfordparser
- Chinese and English dependent syntactic analysis: stanforddependencyparser, Stanfordneuraldependencyparser
2 Considerations during installation configuration
This article is configured with the Python 3.5.2 and Java version "1.8.0_111" versions, which require attention to the following points:
- Stanford NLP Toolkit requires Java 8 and later versions, check the Java version if an error occurs
- This article is configured to take Stanford NLP 3.6.0 as an example, if you are using a different version, be careful to replace the appropriate file name
- The configuration process in this article takes NLTK 3.2 as an example, if NLTK 3.1 is used, it is important to note that the Stanfordsegmenter is not implemented in the old version and the rest is roughly the same
- The following configuration procedure is specific details can be referred to: http://nlp.stanford.edu/software/
3 standfordnlp necessary toolkit download
Necessary package download: only need to download the following two files is enough,STANFORDNLTK file is the STANFORDNLP Toolkit in the NLTK of the jar package and related files
- STANFORDNLTK: I have packaged all the necessary packages and related documents, and we have a detailed explanation below.
- Jar1.8 : If you are in Java version 8 or above, you can not download the
STANFORDNLTK The directory structure is as follows : ( from the various compressed files have been extracted, if the reader is interested, below are the various functions of the source files )
- participle dependency: stanford-s Egmenter.jar, Slf4j-api.jar, Data folder related sub-file
- named entity recognition Dependency: classifiers, Stanford-ner.jar
- pos tagging dependency: Models , Stanford-postagger.jar
- syntactic analysis dependency: Stanford-parser.jar, Stanford-parser-3.6.0-models.jar, classifiers
- dependency parsing dependencies: Stanford-parser.jar, Stanford-parser-3.6.0-models.jar , classifiers &NBSP;
Compressed package download and source code Analysis :
- Word compression package: Stanfordsegmenter and stanfordtokenizer: Download
-
- named entity recognition Compression package: Download stanford-ner-2015-12-09.zip (Version 3.6.0) , will extract get Stanford-ner.jar and classifiers files
- syntactic analysis , syntactic dependency analysis: Download stanford-parser-full-2015-12-09.zip (version 3.6.0) Extract get Stanford-parser.jar and Stanford-parser-3.6.0-models.jar
4 STANDFORDNLP related core operations 4.1 participle
stanfordsegmenter Chinese word : Download 52NLP changed NLTK package Nltk-develop, unzip it and copy it to your Python directory, go in E:\Python\ Nltk-develop Open the setup.py file using the Python editor, F5 run and enter the following code:
>>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter
>>> segmenter = StanfordSegmenter (
path_to_jar = r "E: \ tools \ stanfordNLTK \ jar \ stanford-segmenter.jar",
path_to_slf4j = r "E: \ tools \ stanfordNLTK \ jar \ slf4j-api.jar",
path_to_sihan_corpora_dict = r "E: \ tools \ stanfordNLTK \ jar \ data",
path_to_model = r "E: \ tools \ stanfordNLTK \ jar \ data \ pku.gz",
path_to_dict = r "E: \ tools \ stanfordNLTK \ jar \ data \ dict-chris6.ser.gz"
)
>>> str = "I started a blog in Blog Garden. My blog is Fucao Weicun, and I wrote some articles on natural language processing."
>>> result = segmenter.segment (str)
>>> result
Execution Result :
Program Interpretation : Stanfordsegmenter Initialization parameter description:
- Path_to_jar: Used to locate the jar package, this program participle relies on Stanford-segmenter.jar ( Note: All other Stanford NLP interfaces have path_to_jar this parameter. )
- PATH_TO_SLF4J: Used to locate Slf4j-api.jar action on participle
- Path_to_sihan_corpora_dict: Set to stanford-segmenter-2015-12-09.zip the data directory in the extracted directory, there are two models available in the Data directory pkg.gz and ctb.gz need to be aware that , using Stanfordsegmenter for Chinese word segmentation, its return result is not a list, but a string, each Chinese word in which is separated by a space.
stanfordtokenizer English participle : Related References
Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> from nltk.tokenize import StanfordTokenizer
>>> tokenizer = StanfordTokenizer(path_to_jar=r"E:\tools\stanfordNLTK\jar\stanford-parser.jar")
>>> sent = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\nThanks."
>>> print(tokenizer.tokenize(sent))
[‘Good‘, ‘muffins‘, ‘cost‘, ‘$‘, ‘3.88‘, ‘in‘, ‘New‘, ‘York‘, ‘.‘, ‘Please‘, ‘buy‘, ‘me‘, ‘two‘, ‘of‘, ‘them‘, ‘.‘, ‘Thanks‘, ‘.‘]
>>>
Execution Result:
4.2 named entity recognition
Stanfordnertagger English named entity recognition
>>> from nltk.tag import StanfordNERTagger
>>> eng_tagger = StanfordNERTagger(model_filename=r‘E:\tools\stanfordNLTK\jar\classifiers\english.all.3class.distsim.crf.ser.gz‘,path_to_jar=r‘E:\tools\stanfordNLTK\jar\stanford-ner.jar‘)
>>> print(eng_tagger.tag(‘Rami Eid is studying at Stony Brook University in NY‘.split()))
[(‘Rami‘, ‘PERSON‘), (‘Eid‘, ‘PERSON‘), (‘is‘, ‘O‘), (‘studying‘, ‘O‘), (‘at‘, ‘O‘), (‘Stony‘, ‘ORGANIZATION‘), (‘Brook‘, ‘ORGANIZATION‘), (‘University‘, ‘ORGANIZATION‘), (‘in‘, ‘O‘), (‘NY‘, ‘O‘)]
Operation result :
Stanfordnertagger Chinese named entity recognition
>>> result
‘Chengdu University of Information Technology, Sichuan Province. I started a blog in the blog garden. My blog is Fu Cao Wei Cun, and I wrote some articles on natural language processing. \ r \ n ‘
>>> from nltk.tag import StanfordNERTagger
>>> chi_tagger = StanfordNERTagger (model_filename = r'E: \ tools \ stanfordNLTK \ jar \ classifiers \ chinese.misc.distsim.crf.ser.gz ', path_to_jar = r'E: \ tools \ stanfordNLTK \ jar \ stanford- ner.jar ')
>>> for word, tag in chi_tagger.tag (result.split ()):
print (word, tag)
Operation result :
4.3 pos Labeling
stanfordpostagger English pos tagging
>>> from nltk.tag import StanfordPOSTagger
>>> eng_tagger = StanfordPOSTagger(model_filename=r‘E:\tools\stanfordNLTK\jar\models\english-bidirectional-distsim.tagger‘,path_to_jar=r‘E:\tools\stanfordNLTK\jar\stanford-postagger.jar‘)
>>> print(eng_tagger.tag(‘What is the airspeed of an unladen swallow ?‘.split()))
Operation result :
stanfordpostagger Chinese pos tagging
>>> from nltk.tag import StanfordPOSTagger
>>> chi_tagger = StanfordPOSTagger (model_filename = r‘E: \ tools \ stanfordNLTK \ jar \ models \ chinese-distsim.tagger ‘, path_to_jar = r‘E: \ tools \ stanfordNLTK \ jar \ stanford-postagger.jar‘)
>>> result
‘Chengdu University of Information Technology, Sichuan Province. I started a blog in the blog garden. My blog is Fu Cao Wei Cun, and I wrote some articles on natural language processing. \ r \ n ‘
>>> print (chi_tagger.tag (result.split ()))
Operation result :
4.4 Syntactic analysis :Reference Documents
Stanfordparser English Grammar Analysis
>>> from nltk.parse.stanford import StanfordParser
>>> eng_parser = StanfordParser(r"E:\tools\stanfordNLTK\jar\stanford-parser.jar",r"E:\tools\stanfordNLTK\jar\stanford-parser-3.6.0-models.jar",r"E:\tools\stanfordNLTK\jar\classifiers\englishPCFG.ser.gz")
>>> print(list(eng_parser.parse("the quick brown fox jumps over the lazy dog".split())))
Operation result :
stanfordparser Chinese syntactic analysis
>>> from nltk.parse.stanford import StanfordParser
>>> chi_parser = StanfordParser (r "E: \ tools \ stanfordNLTK \ jar \ stanford-parser.jar", r "E: \ tools \ stanfordNLTK \ jar \ stanford-parser-3.6.0-models.jar", r "E: \ tools \ stanfordNLTK \ jar \ classifiers \ chinesePCFG.ser.gz")
>>> sent = u 'Beihai has become a rising star in China's opening up'
>>> print (list (chi_parser.parse (sent.split ())))
Operation result :
4.5 Dependency Syntax Analysis
An analysis of Stanforddependencyparser English dependent syntax
>>> from nltk.parse.stanford import StanfordDependencyParser
>>> eng_parser = StanfordDependencyParser(r"E:\tools\stanfordNLTK\jar\stanford-parser.jar",r"E:\tools\stanfordNLTK\jar\stanford-parser-3.6.0-models.jar",r"E:\tools\stanfordNLTK\jar\classifiers\englishPCFG.ser.gz")
>>> res = list(eng_parser.parse("the quick brown fox jumps over the lazy dog".split()))
>>> for row in res[0].triples():
print(row)
Operation result :
The syntactic analysis of Stanforddependencyparser Chinese dependency
>>> from nltk.parse.stanford import StanfordDependencyParser
>>> chi_parser = StanfordDependencyParser (r "E: \ tools \ stanfordNLTK \ jar \ stanford-parser.jar", r "E: \ tools \ stanfordNLTK \ jar \ stanford-parser-3.6.0-models.jar", r "E: \ tools \ stanfordNLTK \ jar \ classifiers \ chinesePCFG.ser.gz")
>>> res = list (chi_parser.parse (u‘Sichuan has become a rising star in Western China ’s opening to the world’.split ()))
>>> for row in res [0] .triples ():
print (row)
Operation result :
5 References and knowledge expansion
- NLTK official website
- NLTK's API
- NLTK using the Stanford Chinese word breaker
- NLTK Source on GitHub
"NLP" dry foods! Python NLTK Text Processing in conjunction with Stanford NLP Toolkit