"NLP" dry foods! Python NLTK Text Processing in conjunction with the Stanford NLP Toolkit

Source: Internet
Author: User
Tags stanford nlp nltk


Dry Foods! Details how to use the Stanford NLP Toolkit under Python nltk


Bai Ningsu



November 6, 2016 19:28:43


Summary:NLTK is a natural language toolkit implemented by the University of Pennsylvania Computer and information science using the Python language, which collects a large number of public datasets and provides a comprehensive, easy-to-use interface on the model, covering participle, The functions of part-of-speech tagging (Part-of-speech tag, Pos-tag), named entity recognition (Named entity recognition, NER), syntactic analysis (syntactic Parse), and other NLP domains. and stanford NLP is a NLP toolkit implemented by the open source Java Group of the Stanford University's NLP team, which also provides solutions to the various problems in the NLP domain. This analysis is very convenient and useful. This article mainly introduces NLTK (Natural language Toolkit) under Configuration installation stanford NLP , and the standford  NLP Core module to demonstrate, so that the reader can easily understand the knowledge of this chapter, follow-up will continue to use the Great Qin Empire corpus to the word segmentation, POS tagging, named entity recognition, syntactic analysis, syntactic dependency analysis for a detailed demonstration. For Python basics, see the "Python Five slow Play" article ( < Span style= "color: #ff0000;" * Original authoring, reproduced annotated source : dry! Detailing how Python NLTK uses the Stanford NLP Toolkit )
Catalogue


Python NLTK into the Great Qin Empire



Dry! Details how to use the Stanford NLP Toolkit under Python NLTK


1 NLTK and STANDFORDNLP introduction


NLTK: A Natural language toolkit implemented by the University of Pennsylvania Computer and information science using the Python language, which collects a large number of public datasets and provides a comprehensive, easy-to-use interface on the model, covering word segmentation, part-of-speech tagging (Part-of-speech tag, Pos-tag), named entity recognition (Named entity recognition, NER), syntactic analysis (syntactic Parse), and other functions of the NLP domain.



Stanford NLP: The NLP Toolkit, implemented by open source Java, by the NLP group at Stanford University, also provides solutions to the various problems in the NLP domain. the NLP team at Stanford University is a world-renowned research group that can combine NLTK and Stanford NLP with two toolkits, which is great for natural language developers! In 2004, Steve Bird added support for the Stanford NLP Toolkit in NLTK by calling external jar files to use the functionality of the Stanford NLP Toolkit. This analysis seems to be very convenient and useful.



This article provides the following features in Stanford NLP in the main introduction to NLTK:


    1. Chinese and English participle: stanfordtokenizer
    2. Chinese and English pos notation: Stanfordpostagger
    3. Chinese and English named entity recognition: Stanfordnertagger
    4. Syntactic analysis in Chinese and English : Stanfordparser
    5. Chinese and English dependent syntactic analysis: stanforddependencyparser, Stanfordneuraldependencyparser
2 Considerations during installation configuration


This article is configured with the Python 3.5.2 and Java version "1.8.0_111" versions, which require attention to the following points:


    • Stanford NLP Toolkit requires Java 8 and later versions, check the Java version if an error occurs
    • This article is configured to take Stanford NLP 3.6.0 as an example, if you are using a different version, be careful to replace the appropriate file name
    • The configuration process in this article takes NLTK 3.2 as an example, if NLTK 3.1 is used, it is important to note that the Stanfordsegmenter is not implemented in the old version and the rest is roughly the same
    • The following configuration procedure is specific details can be referred to: http://nlp.stanford.edu/software/
3 standfordnlp necessary toolkit download


Necessary package download: only need to download the following two files is enough,STANFORDNLTK file is the STANFORDNLP Toolkit in the NLTK of the jar package and related files


    1. STANFORDNLTK: I have packaged all the necessary packages and related documents, and we have a detailed explanation below.
    2. Jar1.8 : If you are in Java version 8 or above, you can not download the


STANFORDNLTK The directory structure is as follows : ( from the various compressed files have been extracted, if the reader is interested, below are the various functions of the source files )




    • participle dependency: stanford-s Egmenter.jar, Slf4j-api.jar, Data folder related sub-file
    • named entity recognition Dependency: classifiers, Stanford-ner.jar
    • pos tagging dependency: Models , Stanford-postagger.jar
    • syntactic analysis dependency: Stanford-parser.jar, Stanford-parser-3.6.0-models.jar, classifiers
    • dependency parsing dependencies: Stanford-parser.jar, Stanford-parser-3.6.0-models.jar , classifiers &NBSP;


Compressed package download and source code Analysis :


    1. Word compression package: Stanfordsegmenter and stanfordtokenizer: Download
    2. named entity recognition Compression package: Download stanford-ner-2015-12-09.zip (Version 3.6.0)  , will extract get Stanford-ner.jar and classifiers files
    3. syntactic analysis , syntactic dependency analysis: Download stanford-parser-full-2015-12-09.zip (version 3.6.0) Extract get Stanford-parser.jar and Stanford-parser-3.6.0-models.jar
4 STANDFORDNLP related core operations 4.1 participle


stanfordsegmenter Chinese word : Download 52NLP changed NLTK package Nltk-develop, unzip it and copy it to your Python directory, go in E:\Python\ Nltk-develop Open the setup.py file using the Python editor, F5 run and enter the following code:


>>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter
>>> segmenter = StanfordSegmenter (
     path_to_jar = r "E: \ tools \ stanfordNLTK \ jar \ stanford-segmenter.jar",
     path_to_slf4j = r "E: \ tools \ stanfordNLTK \ jar \ slf4j-api.jar",
     path_to_sihan_corpora_dict = r "E: \ tools \ stanfordNLTK \ jar \ data",
     path_to_model = r "E: \ tools \ stanfordNLTK \ jar \ data \ pku.gz",
     path_to_dict = r "E: \ tools \ stanfordNLTK \ jar \ data \ dict-chris6.ser.gz"
)
>>> str = "I started a blog in Blog Garden. My blog is Fucao Weicun, and I wrote some articles on natural language processing."
>>> result = segmenter.segment (str)
>>> result


Execution Result :





Program Interpretation : Stanfordsegmenter Initialization parameter description:


    • Path_to_jar: Used to locate the jar package, this program participle relies on Stanford-segmenter.jar ( Note: All other Stanford NLP interfaces have path_to_jar this parameter. )
    • PATH_TO_SLF4J: Used to locate Slf4j-api.jar action on participle
    • Path_to_sihan_corpora_dict: Set to stanford-segmenter-2015-12-09.zip the data directory in the extracted directory, there are two models available in the Data directory pkg.gz and ctb.gz need to be aware that , using Stanfordsegmenter for Chinese word segmentation, its return result is not a list, but a string, each Chinese word in which is separated by a space.


stanfordtokenizer English participle : Related References


Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> from nltk.tokenize import StanfordTokenizer
>>> tokenizer = StanfordTokenizer(path_to_jar=r"E:\tools\stanfordNLTK\jar\stanford-parser.jar")
>>> sent = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks."
>>> print(tokenizer.tokenize(sent))
[‘Good‘, ‘muffins‘, ‘cost‘, ‘$‘, ‘3.88‘, ‘in‘, ‘New‘, ‘York‘, ‘.‘, ‘Please‘, ‘buy‘, ‘me‘, ‘two‘, ‘of‘, ‘them‘, ‘.‘, ‘Thanks‘, ‘.‘]
>>> 


Execution Result:




4.2 named entity recognition


Stanfordnertagger English named entity recognition


>>> from nltk.tag import StanfordNERTagger
>>> eng_tagger = StanfordNERTagger(model_filename=r‘E:\tools\stanfordNLTK\jar\classifiers\english.all.3class.distsim.crf.ser.gz‘,path_to_jar=r‘E:\tools\stanfordNLTK\jar\stanford-ner.jar‘)
>>> print(eng_tagger.tag(‘Rami Eid is studying at Stony Brook University in NY‘.split()))
[(‘Rami‘, ‘PERSON‘), (‘Eid‘, ‘PERSON‘), (‘is‘, ‘O‘), (‘studying‘, ‘O‘), (‘at‘, ‘O‘), (‘Stony‘, ‘ORGANIZATION‘), (‘Brook‘, ‘ORGANIZATION‘), (‘University‘, ‘ORGANIZATION‘), (‘in‘, ‘O‘), (‘NY‘, ‘O‘)]


Operation result :





Stanfordnertagger Chinese named entity recognition


>>> result
‘Chengdu University of Information Technology, Sichuan Province. I started a blog in the blog garden. My blog is Fu Cao Wei Cun, and I wrote some articles on natural language processing. \ r \ n ‘
>>> from nltk.tag import StanfordNERTagger
>>> chi_tagger = StanfordNERTagger (model_filename = r'E: \ tools \ stanfordNLTK \ jar \ classifiers \ chinese.misc.distsim.crf.ser.gz ', path_to_jar = r'E: \ tools \ stanfordNLTK \ jar \ stanford- ner.jar ')
>>> for word, tag in chi_tagger.tag (result.split ()):
print (word, tag)


Operation result :





4.3 pos Labeling


stanfordpostagger English pos tagging


>>> from nltk.tag import StanfordPOSTagger
>>> eng_tagger = StanfordPOSTagger(model_filename=r‘E:\tools\stanfordNLTK\jar\models\english-bidirectional-distsim.tagger‘,path_to_jar=r‘E:\tools\stanfordNLTK\jar\stanford-postagger.jar‘)
>>> print(eng_tagger.tag(‘What is the airspeed of an unladen swallow ?‘.split()))


Operation result :





stanfordpostagger Chinese pos tagging


>>> from nltk.tag import StanfordPOSTagger
>>> chi_tagger = StanfordPOSTagger (model_filename = r‘E: \ tools \ stanfordNLTK \ jar \ models \ chinese-distsim.tagger ‘, path_to_jar = r‘E: \ tools \ stanfordNLTK \ jar \ stanford-postagger.jar‘)
>>> result
‘Chengdu University of Information Technology, Sichuan Province. I started a blog in the blog garden. My blog is Fu Cao Wei Cun, and I wrote some articles on natural language processing. \ r \ n ‘
>>> print (chi_tagger.tag (result.split ()))


Operation result :








4.4 Syntactic analysis :Reference Documents


Stanfordparser English Grammar Analysis


>>> from nltk.parse.stanford import StanfordParser
>>> eng_parser = StanfordParser(r"E:\tools\stanfordNLTK\jar\stanford-parser.jar",r"E:\tools\stanfordNLTK\jar\stanford-parser-3.6.0-models.jar",r"E:\tools\stanfordNLTK\jar\classifiers\englishPCFG.ser.gz")
>>> print(list(eng_parser.parse("the quick brown fox jumps over the lazy dog".split())))


Operation result :





stanfordparser Chinese syntactic analysis


>>> from nltk.parse.stanford import StanfordParser
>>> chi_parser = StanfordParser (r "E: \ tools \ stanfordNLTK \ jar \ stanford-parser.jar", r "E: \ tools \ stanfordNLTK \ jar \ stanford-parser-3.6.0-models.jar", r "E: \ tools \ stanfordNLTK \ jar \ classifiers \ chinesePCFG.ser.gz")
>>> sent = u 'Beihai has become a rising star in China's opening up'
>>> print (list (chi_parser.parse (sent.split ())))


Operation result :




4.5 Dependency Syntax Analysis


An analysis of Stanforddependencyparser English dependent syntax


>>> from nltk.parse.stanford import StanfordDependencyParser
>>> eng_parser = StanfordDependencyParser(r"E:\tools\stanfordNLTK\jar\stanford-parser.jar",r"E:\tools\stanfordNLTK\jar\stanford-parser-3.6.0-models.jar",r"E:\tools\stanfordNLTK\jar\classifiers\englishPCFG.ser.gz")
>>> res = list(eng_parser.parse("the quick brown fox jumps over the lazy dog".split()))
>>> for row in res[0].triples():
    print(row)


Operation result :





The syntactic analysis of Stanforddependencyparser Chinese dependency


>>> from nltk.parse.stanford import StanfordDependencyParser
>>> chi_parser = StanfordDependencyParser (r "E: \ tools \ stanfordNLTK \ jar \ stanford-parser.jar", r "E: \ tools \ stanfordNLTK \ jar \ stanford-parser-3.6.0-models.jar", r "E: \ tools \ stanfordNLTK \ jar \ classifiers \ chinesePCFG.ser.gz")
>>> res = list (chi_parser.parse (u‘Sichuan has become a rising star in Western China ’s opening to the world’.split ()))
>>> for row in res [0] .triples ():
     print (row) 


Operation result :




5 References and knowledge expansion
    1. NLTK official website
    2. NLTK's API
    3. NLTK using the Stanford Chinese word breaker
    4. NLTK Source on GitHub





"NLP" dry foods! Python NLTK Text Processing in conjunction with Stanford NLP Toolkit


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.