Python Natural Language Processing tool summary
Bai Ningsu
November 21, 2016 21:45:26
1 Python's several natural language processing tools
- NLTK:NLTK is a leader in using Python to process natural language tools. It provides an excuse for WordNet to deal with lexical resources conveniently, as well as classification, word segmentation, stem, labeling, grammatical analysis, semantic inference and other class libraries.
- Pattern:pattern's natural language processing tools include the POS tagging tool (Part-of-speech tagger), N-Meta search (N-gram search), sentiment analysis (sentiment analyst), WordNet. Support machine learning vector space model, clustering, vector machine.
Textblob:textblob is a Python library that handles text data. Some simple APIs are provided to solve some natural language processing tasks, such as pos tagging, noun phrase extraction, affective analysis, classification, translation and so on.
- Gensim:gensim provides the function of theme modeling, file indexing and similarity retrieval for large corpora. It can handle data that is larger than RAM memory. The authors say it is "the strongest, most efficient, and most accessible software for implementing non-intrusive modeling from plain text semantics."
- PYNLPI: Its full name is: Python Natural language processing library (python Natural Language processing library, Voice attack: Pineapple) This is a collection of various natural language processing tasks, Pynlpi can be used to process N-ary searches, calculate frequency tables and distributions, and build language models. He can also handle this more complex data structure to the priority queue, or search for more complex algorithms like Beam.
- Spacy: This is a commercial open source software. Combined with Python and Cython, its natural language processing power has reached industrial strength. Is the fastest, most advanced natural language processing tool in the field.
- Polyglot:polyglot supports the processing of massive text and multiple languages. It supports participle of 165 languages, identification of 196 languages, proper noun recognition in 40 languages, POS tagging in 16 languages, affective analysis of 136 languages, embedding of 137 languages, morphological analysis of 135 languages, and translation of 69 languages.
- Montylingua:montylingua is a free, well-trained, end-to-end English processing tool. By entering the original English text to Montylingua, you will get the semantic interpretation of this text. It is suitable for tasks such as information retrieval and extraction, problem handling, answering questions and so on. From the English text, it can extract the active bin groups, adjectives, nouns and verb phrases, names, place names, events, dates and times, and other semantic information.
- Bllip Parser:bllip Parser (also known as Charniak-johnson Parser) is a statistical natural language tool integrated with the generation of compositional analysis and maximum entropy ordering. Includes the command line and the Python interface .
- Quepy:quepy is a Python framework that provides the conversion of natural language into a database query language. The conversion of different types of natural language and database query language can be implemented easily. So, by Quepy, just modifying a few lines of code, you can implement your own natural language query database system. Github:https://github.com/machinalis/quepy
- HANNLP:HANLP is a Java toolkit consisting of a series of models and algorithms that aim to popularize the application of natural language processing in production environments. Not only participle, but provide lexical analysis, syntactic analysis, semantic understanding and other complete functions. HANLP has the features of perfect function, high performance, clear structure, and a newly-customizable corpus. Document Usage Instructions:Python calls the natural language processing package HANLP and rookie how to call HANNLP
2
OPENNLP: Chinese named entity recognition
OPENNLP is the Java Natural Language Processing API under Apach, fully functional. let us introduce the process of using OPENNLP to name entity recognition of Chinese corpus.
First of all, the preprocessing work, Word Word to listen to the words and so on is not verbose, in fact, the result of the word will be separated by a blank space can be, OPENNLP can be such a form of corpus processing in English, some of the note on character processing will be mentioned in the following.
Secondly, we want to prepare the thesaurus for each named entity category, the Thesaurus is in the text document, the document name is the typename of the named entity category, the following two function are loaded into the words of a class named entity thesaurus and the class loaded into the named entity.
/** * load the named entity in the thesaurus * * @param namelistfile * @return * @throws Exception */public St atic list<string> loadnamewords (File namelistfile) throws Exception {list<string> namewords = new ArrayList& Lt String> (); if (!namelistfile.exists () | | namelistfile.isdirectory ()) {System.err.println ("No file exists"); return null;} BufferedReader br = new BufferedReader (new FileReader (Namelistfile)); String line = null;while (line = Br.readline ())! = null) {Namewords.add (line);} Br.close (); return namewords;} /** * Get named entity type * * @param namelistfile * @return */public static string Getnametype (File namelistfile) {String nametype = Namelistfile.getname (); return nametype.substring (0, Nametype.lastindexof ("."));}
Because the training corpus OPENNLP requires is this:
xxxxxx<start:person>???? <END>XXXXXXXXX<START:Action>???? <end>xxxxxxx
Next is the training of the named entity Recognition model, first code:
Import Java.io.file;import java.io.fileoutputstream;import Java.io.ioexception;import Java.io.StringReader;import Java.util.collections;import Opennlp.tools.namefind.namefinderme;import Opennlp.tools.namefind.namesample;import Opennlp.tools.namefind.namesampledatastream;import Opennlp.tools.namefind.tokennamefindermodel;import Opennlp.tools.util.objectstream;import Opennlp.tools.util.plaintextbylinestream;import Opennlp.tools.util.featuregen.aggregatedfeaturegenerator;import Opennlp.tools.util.featuregen.previousmapfeaturegenerator;import Opennlp.tools.util.featuregen.tokenclassfeaturegenerator;import Opennlp.tools.util.featuregen.tokenfeaturegenerator;import Opennlp.tools.util.featuregen.WindowFeatureGenerator /** * Chinese named entity Recognition Model Training Component * * @author Ddlovehy * */public class Namedentitymultifindtrainer {//default parameter private int iterations = 80;private int cutoff = 5;private string langcode = "General";p rivate String type = "Default";//parameter to be set private String name Wordspath; Named entity Thesaurus Path Private String DataPath; Training set has been divided into Word material path private String modelpath; Model Storage Path Public Namedentitymultifindtrainer () {super ();//TODO auto-generated constructor Stub}public Namedentitymultifindtrainer (String Namewordspath, String datapath,string modelpath) {super (); This.namewordspath = Namewordspath;this.datapath = Datapath;this.modelpath = Modelpath;} Public Namedentitymultifindtrainer (int iterations, int cutoff,string langcode, String type, String namewordspath,string DataPath, String Modelpath) {super (); this.iterations = Iterations;this.cutoff = Cutoff;this.langcode = Langcode; This.type = Type;this.namewordspath = Namewordspath;this.datapath = Datapath;this.modelpath = ModelPath;} /** * Generate Custom features * * @return */public aggregatedfeaturegenerator prodfeaturegenerators () {aggregatedfeaturegenerator feature Generators = new Aggregatedfeaturegenerator (new Windowfeaturegenerator (New Tokenfeaturegenerator (), 2, 2), new Windowfeaturegenerator (New Tokenclassfeaturegenerator (), 2,2), New Previousmapfeaturegenerator ()); rEturn featuregenerators;} /** * Write models to disk * * @param model * @throws Exception */public void Writemodelintodisk (Tokennamefindermodel model) throws Ex ception {File Outmodelfile = new File (This.getmodelpath ()); FileOutputStream Outmodelstream = new FileOutputStream (outmodelfile); model.serialize (Outmodelstream);} /** * read out the Annotated training corpus * * @return * @throws Exception */public String gettraincorpusdatastr () throws Exception {//TODO consider into persistence and the incremental training string traindatastr = Null;traindatastr = Nameentitytextfactory.prodnamefindtraintext ( This.getnamewordspath (), This.getdatapath (), null); return TRAINDATASTR;} /** * Training Model * * @param TRAINDATASTR * Annotated Training data Overall String * @return * @throws Exception */public Tokennamefindermodel Trainnameentitysamples (String traindatastr) throws Exception {objectstream<namesample> nameentitysample = new Namesampledatastream (New Plaintextbylinestream (new StringReader (TRAINDATASTR))); System.out.println ("**************************************"); System.out.prIntln (TRAINDATASTR); Tokennamefindermodel Namefindermodel = Namefinderme.train (This.getlangcode (), This.gettype (), NameEntitySample, This.prodfeaturegenerators (), collections.<string, object> emptymap (), This.getiterations (), This.getCutoff () ); return Namefindermodel;} /** * Training Component Total Call method * * @return */public boolean execnamefindtrainer () {try {String traindatastr = This.gettraincorpusdatastr (); Tokennamefindermodel Namefindermodel = This.trainnameentitysamples (TRAINDATASTR);//System.out.println ( Namefindermodel); This.writemodelintodisk (Namefindermodel); return true;} catch (Exception e) {//TODO auto-generated catch Blocke.printstacktrace (); return false;}} }
Note:
- Parameters: Iterations is the number of iterations of the training algorithm, too little to get the effect of training, too large will cause the fitting, so you can try their own results;
- Cutoff: The size of the Language Model Scan window, generally set to 5 can be, of course, the greater the effect the better, time may be unbearable;
- Langcode: Language code and type entity category, because there is no specific code for Chinese, set to "normal", the category of the entity because we want to be trained to recognize a variety of entities of the model, so set to "default."
Description
- The Prodfeaturegenerators () method is used to generate a custom-made feature generator, meaning the choice of what kind of n-gram semantic model, the code shows the selection window size of 5, Before and after the named entity words to scan the range of two words to calculate the characteristics (plus oneself is 5), perhaps a deeper and more accurate meaning, please correct them;
- Trainnameentitysamples () method, the core of the training model , first of all, the above annotated training corpus string into the generated character stream, and then through the Namefinderme train () method of the parameters set above, Custom feature generator and so on, about the source entity mapping pair, just press the default incoming empty map.
Source code in: Https://github.com/Ailab403/ailab-mltk4j,test package in the face should be the full call demo, as well as the file folder inside the test corpus and the model has been trained.
3 STANFORDNLP: Realizing Chinese named entity recognition
Use Stanford Word Segmenter and Stanford Named entity recognizer (NER) to implement Chinese named entity recognition-Home, township, country, world-blog channel-csdn.net
1. Introduction of participle
Stanford University of the word breaker, the system needs JDK 1.8+, download stanford-segmenter-2014-10-26 from the link above, unzip, as shown in the data directory, where there are two gz compressed files, respectively, ctb.gz and pku.gz, whereCTB: University of Pennsylvania Training materials for Chinese trees,PKU: Training materials provided by Peking University, China. Of course, you can also train yourself, a training example can be seen hereHTTP://NLP.STANFORD.EDU/SOFTWARE/TRAINSEGMENTER-20080521.TAR.GZ2, Ner Introduction Stanford NER is implemented in Java to identify (person, organization,location), the research published using the software should cite the following papers: in: Http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdfOn the NER page can be downloaded to two compressed files, respectively stanford-ner-2014-10-26 and Stanford-ner-2012-11-11-chinese to extract two files can see the default NER can be used to handle English, If you need to handle Chinese, you should handle it separately. 3, participle and ner use a new Java project in eclipse to copy the data directory to the project root path.and then copy all the contents of Stanford-ner-2012-11-11-chinese to the classifiers folder ,Adding stanford-segmenter-3.5.0 to the classpath,Copy the classifiers folder to the project root directory and add Stanford-ner-3.5.0.jar and Stanford-ner.jar to Classpath. Finally, goHttp://nlp.stanford.edu/software/ corenlp.shtml Download stanford-corenlp-full-2014-10-31, the decompression after the stanford-corenlp-3.5.0 also added to the classpath. The final eclipse structure is as follows: Chinese NER: This paragraph shows that it is clear that the result of Chinese word segmentation needs to be entered as NER, and then NER can be identified. At the same time easy to test, this demo use Junit-4.10.jar, the following code to begin
Import Java.io.File; Import java.io.IOException; Import java.util.Properties; Import Org.apache.commons.io.FileUtils; Import Edu.stanford.nlp.ie.crf.CRFClassifier; Import Edu.stanford.nlp.ling.CoreLabel; /** * * <p> * Description use Stanford CORENLP for Chinese word segmentation * </p> * */public class Zh_segdemo {public static Crfclas Sifier<corelabel> Segmenter; static {//Set some initialization parameters Properties props = new properties (); Props.setproperty ("Sighancorporadict", "data"); Props.setprope Rty ("Serdictionary", "data/dict-chris6.ser.gz"); Props.setproperty ("inputencoding", "UTF-8"); Props.setproperty ("Sighanpostprocessing", "true"); Segmenter = new crfclassifier<corelabel> (props); Segmenter.loadclassifiernoexceptions ("data/ctb.gz", props); Segmenter.flags.setProperties (props); public static string Dosegment (string sent) {string[] STRs = (string[]) segmenter.segmentstring (Sent). ToArray (); StringBuffer buf = new StringBuffer (); for (String s:strs) {buf.append (S + "");} System.out.println("Segmented res:" + buf.tostring ()); return buf.tostring (); public static void Main (string[] args) {try {String readfiletostring = fileutils.readfiletostring ("Macau 141 people with food poisoning "Problem oysters"); String dosegment = dosegment (readfiletostring); System.out.println (dosegment); Extractdemo Extractdemo = new Extractdemo (); System.out.println (Extractdemo.doner (dosegment)); System.out.println ("complete!"); } catch (IOException e) {e.printstacktrace ();}} }
Note that it must be the JDK 1.8+ environment, the final output is as follows:
Python Natural Language Processing tool summary