International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

"NLP" dry foods! Python NLTK Text Processing in conjunction with the Stanford NLP Toolkit

Last Update:2016-11-07 Source: Internet

Author: User

Tags stanford nlp nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Dry Foods! Details how to use the Stanford NLP Toolkit under Python nltk

Bai Ningsu

November 6, 2016 19:28:43

Summary:NLTK is a natural language toolkit implemented by the University of Pennsylvania Computer and information science using the Python language, which collects a large number of public datasets and provides a comprehensive, easy-to-use interface on the model, covering participle, The functions of part-of-speech tagging (Part-of-speech tag, Pos-tag), named entity recognition (Named entity recognition, NER), syntactic analysis (syntactic Parse), and other NLP domains. and stanford NLP is a NLP toolkit implemented by the open source Java Group of the Stanford University's NLP team, which also provides solutions to the various problems in the NLP domain. This analysis is very convenient and useful. This article mainly introduces NLTK (Natural language Toolkit) under Configuration installation stanford NLP , and the standford NLP Core module to demonstrate, so that the reader can easily understand the knowledge of this chapter, follow-up will continue to use the Great Qin Empire corpus to the word segmentation, POS tagging, named entity recognition, syntactic analysis, syntactic dependency analysis for a detailed demonstration. For Python basics, see the "Python Five slow Play" article ( < Span style= "color: #ff0000;" * Original authoring, reproduced annotated source : dry! Detailing how Python NLTK uses the Stanford NLP Toolkit )

Catalogue

Python NLTK into the Great Qin Empire

Dry! Details how to use the Stanford NLP Toolkit under Python NLTK

1 NLTK and STANDFORDNLP introduction

NLTK: A Natural language toolkit implemented by the University of Pennsylvania Computer and information science using the Python language, which collects a large number of public datasets and provides a comprehensive, easy-to-use interface on the model, covering word segmentation, part-of-speech tagging (Part-of-speech tag, Pos-tag), named entity recognition (Named entity recognition, NER), syntactic analysis (syntactic Parse), and other functions of the NLP domain.

Stanford NLP: The NLP Toolkit, implemented by open source Java, by the NLP group at Stanford University, also provides solutions to the various problems in the NLP domain. the NLP team at Stanford University is a world-renowned research group that can combine NLTK and Stanford NLP with two toolkits, which is great for natural language developers! In 2004, Steve Bird added support for the Stanford NLP Toolkit in NLTK by calling external jar files to use the functionality of the Stanford NLP Toolkit. This analysis seems to be very convenient and useful.

This article provides the following features in Stanford NLP in the main introduction to NLTK:

Chinese and English participle: stanfordtokenizer
Chinese and English pos notation: Stanfordpostagger
Chinese and English named entity recognition: Stanfordnertagger
Syntactic analysis in Chinese and English : Stanfordparser
Chinese and English dependent syntactic analysis: stanforddependencyparser, Stanfordneuraldependencyparser

2 Considerations during installation configuration

This article is configured with the Python 3.5.2 and Java version "1.8.0_111" versions, which require attention to the following points:

Stanford NLP Toolkit requires Java 8 and later versions, check the Java version if an error occurs
This article is configured to take Stanford NLP 3.6.0 as an example, if you are using a different version, be careful to replace the appropriate file name
The configuration process in this article takes NLTK 3.2 as an example, if NLTK 3.1 is used, it is important to note that the Stanfordsegmenter is not implemented in the old version and the rest is roughly the same
The following configuration procedure is specific details can be referred to: http://nlp.stanford.edu/software/

3 standfordnlp necessary toolkit download

Necessary package download: only need to download the following two files is enough,STANFORDNLTK file is the STANFORDNLP Toolkit in the NLTK of the jar package and related files

STANFORDNLTK: I have packaged all the necessary packages and related documents, and we have a detailed explanation below.
Jar1.8 : If you are in Java version 8 or above, you can not download the

STANFORDNLTK The directory structure is as follows : ( from the various compressed files have been extracted, if the reader is interested, below are the various functions of the source files )

participle dependency: stanford-s Egmenter.jar, Slf4j-api.jar, Data folder related sub-file
named entity recognition Dependency: classifiers, Stanford-ner.jar
pos tagging dependency: Models , Stanford-postagger.jar
syntactic analysis dependency: Stanford-parser.jar, Stanford-parser-3.6.0-models.jar, classifiers
dependency parsing dependencies: Stanford-parser.jar, Stanford-parser-3.6.0-models.jar , classifiers &NBSP;

Compressed package download and source code Analysis :

Word compression package: Stanfordsegmenter and stanfordtokenizer: Download
named entity recognition Compression package: Download stanford-ner-2015-12-09.zip (Version 3.6.0) , will extract get Stanford-ner.jar and classifiers files
syntactic analysis , syntactic dependency analysis: Download stanford-parser-full-2015-12-09.zip (version 3.6.0) Extract get Stanford-parser.jar and Stanford-parser-3.6.0-models.jar

4 STANDFORDNLP related core operations 4.1 participle

stanfordsegmenter Chinese word : Download 52NLP changed NLTK package Nltk-develop, unzip it and copy it to your Python directory, go in E:\Python\ Nltk-develop Open the setup.py file using the Python editor, F5 run and enter the following code:

>>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter
>>> segmenter = StanfordSegmenter (
     path_to_jar = r "E: \ tools \ stanfordNLTK \ jar \ stanford-segmenter.jar",
     path_to_slf4j = r "E: \ tools \ stanfordNLTK \ jar \ slf4j-api.jar",
     path_to_sihan_corpora_dict = r "E: \ tools \ stanfordNLTK \ jar \ data",
     path_to_model = r "E: \ tools \ stanfordNLTK \ jar \ data \ pku.gz",
     path_to_dict = r "E: \ tools \ stanfordNLTK \ jar \ data \ dict-chris6.ser.gz"
)
>>> str = "I started a blog in Blog Garden. My blog is Fucao Weicun, and I wrote some articles on natural language processing."
>>> result = segmenter.segment (str)
>>> result

Execution Result :

Program Interpretation : Stanfordsegmenter Initialization parameter description:

Path_to_jar: Used to locate the jar package, this program participle relies on Stanford-segmenter.jar ( Note: All other Stanford NLP interfaces have path_to_jar this parameter. )
PATH_TO_SLF4J: Used to locate Slf4j-api.jar action on participle
Path_to_sihan_corpora_dict: Set to stanford-segmenter-2015-12-09.zip the data directory in the extracted directory, there are two models available in the Data directory pkg.gz and ctb.gz need to be aware that , using Stanfordsegmenter for Chinese word segmentation, its return result is not a list, but a string, each Chinese word in which is separated by a space.

stanfordtokenizer English participle : Related References

Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> from nltk.tokenize import StanfordTokenizer
>>> tokenizer = StanfordTokenizer(path_to_jar=r"E:\tools\stanfordNLTK\jar\stanford-parser.jar")
>>> sent = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks."
>>> print(tokenizer.tokenize(sent))
[‘Good‘, ‘muffins‘, ‘cost‘, ‘$‘, ‘3.88‘, ‘in‘, ‘New‘, ‘York‘, ‘.‘, ‘Please‘, ‘buy‘, ‘me‘, ‘two‘, ‘of‘, ‘them‘, ‘.‘, ‘Thanks‘, ‘.‘]
>>>

Execution Result:

4.2 named entity recognition

Stanfordnertagger English named entity recognition

>>> from nltk.tag import StanfordNERTagger
>>> eng_tagger = StanfordNERTagger(model_filename=r‘E:\tools\stanfordNLTK\jar\classifiers\english.all.3class.distsim.crf.ser.gz‘,path_to_jar=r‘E:\tools\stanfordNLTK\jar\stanford-ner.jar‘)
>>> print(eng_tagger.tag(‘Rami Eid is studying at Stony Brook University in NY‘.split()))
[(‘Rami‘, ‘PERSON‘), (‘Eid‘, ‘PERSON‘), (‘is‘, ‘O‘), (‘studying‘, ‘O‘), (‘at‘, ‘O‘), (‘Stony‘, ‘ORGANIZATION‘), (‘Brook‘, ‘ORGANIZATION‘), (‘University‘, ‘ORGANIZATION‘), (‘in‘, ‘O‘), (‘NY‘, ‘O‘)]

Operation result :

Stanfordnertagger Chinese named entity recognition

>>> result
‘Chengdu University of Information Technology, Sichuan Province. I started a blog in the blog garden. My blog is Fu Cao Wei Cun, and I wrote some articles on natural language processing. \ r \ n ‘
>>> from nltk.tag import StanfordNERTagger
>>> chi_tagger = StanfordNERTagger (model_filename = r'E: \ tools \ stanfordNLTK \ jar \ classifiers \ chinese.misc.distsim.crf.ser.gz ', path_to_jar = r'E: \ tools \ stanfordNLTK \ jar \ stanford- ner.jar ')
>>> for word, tag in chi_tagger.tag (result.split ()):
print (word, tag)

Operation result :

4.3 pos Labeling

stanfordpostagger English pos tagging

>>> from nltk.tag import StanfordPOSTagger
>>> eng_tagger = StanfordPOSTagger(model_filename=r‘E:\tools\stanfordNLTK\jar\models\english-bidirectional-distsim.tagger‘,path_to_jar=r‘E:\tools\stanfordNLTK\jar\stanford-postagger.jar‘)
>>> print(eng_tagger.tag(‘What is the airspeed of an unladen swallow ?‘.split()))

Operation result :

stanfordpostagger Chinese pos tagging

>>> from nltk.tag import StanfordPOSTagger
>>> chi_tagger = StanfordPOSTagger (model_filename = r‘E: \ tools \ stanfordNLTK \ jar \ models \ chinese-distsim.tagger ‘, path_to_jar = r‘E: \ tools \ stanfordNLTK \ jar \ stanford-postagger.jar‘)
>>> result
‘Chengdu University of Information Technology, Sichuan Province. I started a blog in the blog garden. My blog is Fu Cao Wei Cun, and I wrote some articles on natural language processing. \ r \ n ‘
>>> print (chi_tagger.tag (result.split ()))

Operation result :

4.4 Syntactic analysis :Reference Documents

Stanfordparser English Grammar Analysis

>>> from nltk.parse.stanford import StanfordParser
>>> eng_parser = StanfordParser(r"E:\tools\stanfordNLTK\jar\stanford-parser.jar",r"E:\tools\stanfordNLTK\jar\stanford-parser-3.6.0-models.jar",r"E:\tools\stanfordNLTK\jar\classifiers\englishPCFG.ser.gz")
>>> print(list(eng_parser.parse("the quick brown fox jumps over the lazy dog".split())))

Operation result :

stanfordparser Chinese syntactic analysis

>>> from nltk.parse.stanford import StanfordParser
>>> chi_parser = StanfordParser (r "E: \ tools \ stanfordNLTK \ jar \ stanford-parser.jar", r "E: \ tools \ stanfordNLTK \ jar \ stanford-parser-3.6.0-models.jar", r "E: \ tools \ stanfordNLTK \ jar \ classifiers \ chinesePCFG.ser.gz")
>>> sent = u 'Beihai has become a rising star in China's opening up'
>>> print (list (chi_parser.parse (sent.split ())))

Operation result :

4.5 Dependency Syntax Analysis

An analysis of Stanforddependencyparser English dependent syntax

>>> from nltk.parse.stanford import StanfordDependencyParser
>>> eng_parser = StanfordDependencyParser(r"E:\tools\stanfordNLTK\jar\stanford-parser.jar",r"E:\tools\stanfordNLTK\jar\stanford-parser-3.6.0-models.jar",r"E:\tools\stanfordNLTK\jar\classifiers\englishPCFG.ser.gz")
>>> res = list(eng_parser.parse("the quick brown fox jumps over the lazy dog".split()))
>>> for row in res[0].triples():
    print(row)

Operation result :

The syntactic analysis of Stanforddependencyparser Chinese dependency

>>> from nltk.parse.stanford import StanfordDependencyParser
>>> chi_parser = StanfordDependencyParser (r "E: \ tools \ stanfordNLTK \ jar \ stanford-parser.jar", r "E: \ tools \ stanfordNLTK \ jar \ stanford-parser-3.6.0-models.jar", r "E: \ tools \ stanfordNLTK \ jar \ classifiers \ chinesePCFG.ser.gz")
>>> res = list (chi_parser.parse (u‘Sichuan has become a rising star in Western China ’s opening to the world’.split ()))
>>> for row in res [0] .triples ():
     print (row)

Operation result :

5 References and knowledge expansion

NLTK official website
NLTK's API
NLTK using the Stanford Chinese word breaker
NLTK Source on GitHub

"NLP" dry foods! Python NLTK Text Processing in conjunction with Stanford NLP Toolkit

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

stanford nlp course nlp online course stanford stanford nlp coursera python nlp library nlp examples python nlp python tutorial nlp tutorial python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"NLP" dry foods! Python NLTK Text Processing in conjunction with the Stanford NLP Toolkit

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support