OPENNLP: Harness text, participle of those things

Source: Internet
Author: User

OPENNLP: Harness text, participle of those things

Author Bai Ningsu

March 27, 2016 19:55:03

Summary: the processing libraries of strings, character arrays, and other text representations form the basis of most text handlers. Most languages include a basic processing library, which is a necessary work for text processing or natural language processing. The typical representative is word segmentation, pos tagging, sentence recognition and so on. The tools introduced in this paper are mainly aimed at English participle, for many English word segmentation tools, the author by comparison Apache OPENNLP efficiency and ease of use is better. It also provides an open source API for Java development. The beginning introduces OPENNLP, followed by 6 common models, and concludes with a summary of the use of each model and the Java implementation. Some of the author may question the Chinese word segmentation how to do? The next chapter will be a separate introduction to the Chinese Academy of Sciences research team based on the hidden Markov model Nlpir (ICTCLA). The content is compiled by multiple documents and books, and the code runs correctly. ( This article original, reprint please indicate source : opennlp: Harness text, participle of those things)

1 What is OPENNLP and what "internal strength" does it have?

What is OPENNLP for?

wikipedia:Apache opennlp library is a natural language text processing , which supports some common tasks in natural language processing, such as: labeled , sentence segmentation , pos callout , intrinsic entity extraction (refers to the identification of the proper noun in the sentence, for example: name), Shallow analysis ( word chunking ), syntax analysis and refer to . These tasks often require more advanced word-processing services.

Official documents : Apache's OPENNLP library is a machine-learning toolkit for processing natural language text. It supports the most common NLP tasks such as word breaking, sentence segmentation, partial part-of-speech tagging, named entity extraction, chunking, parsing, and reference digestion. These tasks often require the creation of more advanced word processing services. OPENNLP also includes maximum entropy and perception-based machine learning. The goal of the OPENNLP project is to create a mature toolkit for these tasks. An additional purpose is to provide a large number of pre-built models for various languages, and these models derive self-explanatory text resources.

  • developer :Apache software Foundation
  • stable version : 1.5.2-incubating (November 28, 2011, 5 years ago)
  • Development Status: Active
  • programming language :Java
  • Type: Natural language Processing
  • website :http://incubator.apache.org/opennlp/

Use: It supports Windows, Linux and many other operating systems, this article mainly introduces under Windows:

1 Command line Interface (CLI): The OPENNLP script uses java_cmd and java_home variables to determine which commands are used to execute the Java virtual machine. The OPENNLP script uses the Opennlp_home variable to determine the location of the OPENNLP binary distribution. It is recommended that this variable point to the current OPENNLP version and update the binary distribution of the path variable including $OPENNLP _home/bin or%opennlp_home%\bin. Such a configuration allows easy invocation of OPENNLP. The following example assumes that the configuration is complete. Use the following:

$ OPENNLP ToolName Lang-model-name.bin < input.txt > output.txt
When the tool is executed this way, the model is loaded and the tool is waiting for input from the standard input. This input is processed and printed to standard output.

2 Java in the console: make a call to its API, the following code demonstrates this method.

2 sentence detector

Feature Description: The OPENNLP sentence detector can detect a punctuation mark that marks the end of a sentence or not. In this sense, a sentence is defined as a sequence of characters between the longest blank trimmed punctuation. The first and last sentence make an exception. The first non-whitespace character is assumed to be the beginning of a sentence, and the last non-whitespace character is assumed to be a sentence end. The following sample text should be split into its sentence.

API: The sentence detector also provides an API to train a new sentence detection model. Three basic steps are necessary to train it:

    • The application must open a sample data stream
    • Call sentencedetectorme.t?? Rain method
    • Save Sentencemodel to a file or use it directly

Code implementation:

/** * 1. Sentence detector: sentence Detector  * @deprecated sentence Detector are for detecting sentence boundaries. Given the following paragraph: * Hi. How is it? This is Mike. * such as: Hi. How is it? This is Mike. will return as follows: * Hi. How is it?  * This is Mike. * @throws IOException  * @throws invalidformatexception  */public static void Sentencedetector (String str) throws  Invalidformatexception, ioexception{//always start with a model, a model is learned from training datainputstream is = new FileInputStream ("./nlpbin/en-sent.bin"); Sentencemodel model = new Sentencemodel (IS); Sentencedetectorme sdetector = new Sentencedetectorme (model); String sentences[] = sdetector.sentdetect (str); System.out.println (Sentences[0]); System.out.println (Sentences[1]); Is.close (); System.out.println ("---------------1------------");}

Operation Result:

3 Marker Generator

function Description: The OPENNLP word segment input character sequence is marked. Symbols are usually words, punctuation, numbers, etc. OPENNLP provides the implementation of a variety of marker generators:

    • Blank Tag Generator-a blank tag generator, non-whitespace sequences are identified as symbols
    • Simple markup Generator-a character class tag generator, the same character class as the sequence tag
    • Learn marker generator-a maximum entropy marker generator that detects symbol boundaries based on probabilistic models

AP I: The word breaker can be integrated into the application defined by the API. The shared instance of the Whitespacetokenizer can be retrieved from the static field whitespacetokenizer.instance. The shared instance of the Simpletokenizer can be retrieved in the same way as the simpletokenizer.instance. The instantiation Tokenizerme (in which you can learn the markup generator) symbol model must be created first.

Code implementation:

/** * 2. Tag generator: Tokenizer * @deprecated Tokens is usually words which is separated by space, but there is exceptions.
   
    for example, "isn ' t" gets split into "was" and "N ' t, since it is a a brief format of" was not ". Our sentence are separated into the following tokens: * @param str */public static void tokenize (String str) throws Invalid FormatException, IOException {InputStream is = new FileInputStream ("./nlpbin/en-token.bin"); Tokenizermodel model = new Tokenizermodel (IS); Tokenizer tokenizer = new Tokenizerme (model); String tokens[] = tokenizer.tokenize (str); for (String A:tokens) System.out.println (a); Is.close (); System.out.println ("--------------2-------------");}
   

Operation Result:

4 Name Search

Feature Description: The name Finder detects text-named entities and numbers. To be able to detect the required model for Entity name search. The model is dependent on the language and the entity type which is training. The OPENNLP project offers many of these various free-to-supply corpus-trained pre-trained name framing modes. They can be downloaded from our model download page. The text to find the original text must be split into symbols and sentence names. A detailed description of the phrase in the detector and tag generator tutorial is given. It is important that the markup for the training data and the input text is the same.

API: It is recommended to use the training API instead of the command line tool to train name seekers from the application. Three basic steps are necessary to train it:

    • The application must open a sample data stream
    • Call namefinderme.t?? Rain method
    • Save Tokennamefindermodel to a file or database

Code implementation:

/** * 3. Search: Name finder * @deprecated by it name, name Finder just finds names in the context. Check out the "following example to see" What's the name Finder can do. It accepts an array of strings, and find the names inside. * @param str */public static void FindName () throws IOException {InputStream is = new FileInputStream ("./nlpbin/en-ner-per Son.bin "); Tokennamefindermodel model = new Tokennamefindermodel (is); Is.close (); Namefinderme Namefinder = new Namefinderme (model); string[] sentence = new string[]{    "Mike",    "Tom",    "Smith",    "is",    "a",    "good",    "person"    }; span namespans[] = namefinder.find (sentence); for (span S:namespans) System.out.println (s.tostring ()); System.out.println ("--------------3-------------");}

Operation Result:

5 POS Callout

Description : The part of the voice marker symbol and its corresponding word type based on the symbol itself and the context of the symbol. Symbols may depend on symbols and contexts using multiple POS labels. The OPENNLP POS label uses the probability model to predict the correct POS tag out of the label group. This increases the tag and runtime performance of the tag in order to limit the possible marking of symbolic tokens to the dictionary.

API: Part of the speech tagging training API supports the training of a new POS mode. Three basic steps are necessary to train it:

    • The application must open a sample data stream
    • Calling the Postagger.train method
    • Save Posmodel to a file or database

Code implementation:

/** * 4.POS Callout: POS tagger * @deprecated hi._nnp how_wrb are_vbp you?_jj this_dt is_vbz mike._nnp * @param str */public STA tic void Postag (String str) throws IOException {Posmodel model = new Posmodelloader (). Load (new File ("./nlpbin/en-pos-maxe Nt.bin ")); PerformanceMonitor PerfMon = new PerformanceMonitor (System.err, "sent");//Display load time postaggerme tagger = new Postaggerme ( Model);objectstream<string> Linestream = new Plaintextbylinestream (new StringReader (str));p Erfmon.start (); String Line;while (line = Linestream.read ()) = null) {String whitespacetokenizerline[] = WhitespaceTokenizer.INSTANCE.tokenize (line); string[] tags = tagger.tag (whitespacetokenizerline); Possample sample = new Possample (whitespacetokenizerline, tags); System.out.println (Sample.tostring ());p erfmon.incrementcounter ();} Perfmon.stopandprintfinalresult (); System.out.println ("--------------4-------------");}

Operation Result:

6 details

Description: the text block is divided by the word syntax related parts, such as the noun base, verb-based text, but does not specify its internal structure, nor its role in the main sentence.

API: This generalization provides an API to nurture new generalized patterns. The following sample code demonstrates how to do this:

Code implementation:

/** * 5. Sequence Callout: Chunker * @deprecated divides a sentence into a set of blocks by using the tokens generated by the tag generator. What Chunker does are to partition a sentence to a set of chunks by using the tokens generated by Tokenizer. * @param str */public static void chunk (String str) throws IOException {Posmodel model = new Posmodelloader (). Load (New Fil E ("./nlpbin/en-pos-maxent.bin"));//performancemonitor PerfMon = new PerformanceMonitor (System.err, "sent"); Postaggerme tagger = new Postaggerme (model);objectstream<string> Linestream = new Plaintextbylinestream (new StringReader (str));//perfmon.start (); String Line; String whitespacetokenizerline[] = null; string[] tags = null;while ((line = Linestream.read ()) = null) {Whitespacetokenizerline = whitespacetokenizer.instance.t Okenize (line); tags = tagger.tag (whitespacetokenizerline); Possample sample = new Possample (whitespacetokenizerline, tags); System.out.println (Sample.tostring ());//perfmon.incrementcounter ();} Perfmon.stopandprintfinalresult ();//Chunkerinputstream is = new FileInputStream ("./nlpbin/en-chunker.bin "); Chunkermodel CModel = new Chunkermodel (IS); Chunkerme chunkerme = new Chunkerme (CModel); String result[] = Chunkerme.chunk (whitespacetokenizerline, tags); for (string S:result) System.out.println (s); Span[] span = Chunkerme.chunkasspans (whitespacetokenizerline, tags); for (span S:span) System.out.println (s.tostring () ); System.out.println ("--------------5-------------");}

Operation Result:

7 Analyzer

Description: The simplest way to try the parser is in the command-line tool. This tool is intended for demonstration and testing purposes only. Please use the English block parser model from our website and start the parsing tool with the following command.

Code implementation:

/** * 6. Parser: Parser * @deprecated Given This sentence: "Programcreek is a very huge and useful website.", Parser can retur n the following: * (TOP (S (NP (NN Programcreek)) (VP (VBZ is) (NP (DT a) (ADJP (RB very) (JJ Huge) (CC and) (JJ useful) ))) (. Website.) ) * (TOP * (S * (NP * (NN programcreek) *) * (VP * (VBZ is) * (N                   P * (DT a) * (ADJP * (RB very) * (JJ huge) *  (CC and) * (JJ userful) *) *) * (. Website.) *) *) * @param str */public static void Parse () throws Invalidformatexception, IOException {//Http://sourceforge.net/apps/media Wiki/opennlp/index.php?title=parser#training_toolinputstream is = new FileInputStream ("./nlpbin/ En-parser-chunking.bin "); Parsermodel model = new Parsermodel (IS); Parser Parser = parserfactory.create (model); String sentence = "Programcreek is a very hugE and useful website. "; o Pennlp.tools.parser.Parse topparses[] = parsertool.parseline (sentence, parser, 1); for (Opennlp.tools.parser.Parse p: topparses) p.show (); Is.close ();}

Operation Result:

8 References

1 Official Tutorials Apache OPENNLP Developer Documentation

2 Various models in the OPENNLP

3 OPENNLP Open Source Tools

4 Wikipedia: OPENNLP

Navigating the Text chapter II second section

OPENNLP: Harness text, participle of those things

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.