Java Natural Language Processing NLP Toolkit

Source: Internet
Author: User

Natural language processing 1. Java Natural Language Processing Lingpipe

Lingpipe is a natural language-processing Java Open Source Toolkit. Lingpipe currently has a wide range of features, including topic classification (TOP classification), named entity recognition (Named entity recognition), part-of-speech tagging (part-of Speech Tagging), Sentence detection (sentence Detection), query spelling checker (Spell Checking), interest phrase detection (interseting Phrase Detection), Clustering (clustering), Character Language modeling (Character Language Modeling), medical literature Download/parsing/indexing (MEDLINE Download, parsing and indexing), database text mining , Chinese word segmentation (Chinese Word segmentation), sentiment analysis (sentiment), language discrimination (Language identification) and other APIs.

Download Link: http://alias-i.com/lingpipe/web/download.html

2. Chinese Natural Language Processing toolkit FUDANNLP

FUDANNLP is primarily a toolkit developed for Chinese natural language processing, and also includes machine learning algorithms and datasets for implementing these tasks.

Demo Address: Http://jkx.fudan.edu.cn/nlp/query

FUDANNLP currently implements the following:

    1. Chinese processing tools
      1. Chinese participle
      2. POS Labeling
      3. Entity name recognition
      4. Syntactic analysis
      5. Time-expression recognition
    2. Information retrieval
      1. Text classification
      2. News Cluster
      3. Lucene Chinese participle
    3. Machine learning
      1. Average Perceptron
      2. Passive-aggressive algorithm
      3. K-means
      4. Exact Inference

Download Link: http://code.google.com/p/fudannlp/downloads/list

3. Natural language Processing tool Apache OPENNLP

OPENNLP is a machine learning toolkit for working with natural language text. Support most commonly used NLP tasks, such as: labeling, sentence segmentation, partial part-of-speech tagging, name extraction, grouping, parsing, etc.

Download Link: http://opennlp.apache.org/

4. Natural Language Processing tool crf++

Crf++ is a well-known open source tool for the airport, and is currently the best CRF tool for comprehensive performance. Crf++ itself is a relatively old tool, but given its good performance, it is still a very important tool for natural language processing.

Nlpbamboo Chinese word-breaker is used in this tool.

Download Link: http://sourceforge.net/projects/crfpp/files/

5. Stanford CORENLP Stanford University NLP

A library with a bull fork

http://search.maven.org/#browse%7c11864822

Learn natural language This period of time since the contact and heard a lot of open source natural language processing tools, here to do a summary of the convenience of their own later study, which have their own use of also have not a lot of understanding of the tool after learning familiar with will do the update.

Word breaker component 1. Ikanalyzer

IK Analyzer is an open-source, lightweight Chinese word breaker toolkit developed in the Java language. Starting with the 2006.12 release of version 1.0, IK Analyzer has launched several versions, the current version is U6, originally based on luence, starting from 3.0 as the common word breaker for Java, independent of Luence, is:/http Git.oschina.net/wltea/ik-analyzer-2012ff. IK supports both fine-grained and intelligent segmentation modes, which support English letters, numerals and Chinese words, and are compatible with Korean and Japanese characters. Can support user-defined dictionaries, IKAnalyzer.cfg.xml files to implement, you can configure the custom extension dictionary and the deactivation Dictionary. Dictionaries need to be encoded in UTF-8, with no BOM format, and each word occupies one line. The configuration file looks like this:

  1. <properties>
  2. <comment>ik Analyzer extended configuration </comment>
  3. <!--users can configure their own extension dictionary here--
  4. <entry key="ext_dict">ext.dic; </Entry>
  5. <!--users can configure their own extension stop word dictionary here--
  6. <entry key="ext_stopwords">stopword.dic;chinese_stopword.dic</entry >
  7. </Properties>

IK deployment is simple, just need to deploy Ikanalyzer2012_u6.jar in the project Lib, while the IKAnalyzer.cfg.xml file and the dictionary file in SRC, you can develop the call through the API way.

Example code:

  1. /**
  2. * IK participle function implementation
  3. * @return
  4. */
  5. Public string Spiltwords (String srcstring) {
  6. StringBuffer Wordsbuffer = new StringBuffer ("");
  7. try{
  8. Iksegmenter ik=New Iksegmenter (new StringReader (srcstring), true);
  9. Lexeme lex=null;
  10. While ((Lex=ik.next ()) =null) {
  11. System.out.print (Lex.getlexemetext () + "");
  12. Wordsbuffer.append (Lex.getlexemetext ()). Append ("");
  13. }
  14. }catch (Exception e) {
  15. Logger.error (E.getmessage ());
  16. }
  17. return wordsbuffer.tostring ();
  18. }

IK is simple, easy to expand, word segmentation results are good and written in Java, because my usual project in Java is mostly, so I usually deal with the preferred tool for word segmentation.

2. CAs Ictclas

Ictclas is a word breaker developed by the Chinese Academy of Sciences after several years of computing, using C + + writing. The latest version named ICTCLAS2013, also known as Nlpir Chinese word segmentation system, the official website is: http://ictclas.nlpir.org/. The main functions include Chinese word segmentation, POS tagging, named entity recognition, user dictionary function, support GBK encoding, UTF8 encoding, BIG5 coding, new word segmentation, discovery and keyword extraction. You can visualize interface actions and API calls.

3.FudanNLP

FUDANNLP is primarily a toolkit developed for Chinese natural language processing, and also includes machine learning algorithms and datasets for implementing these tasks. FUDANNLP and its containing datasets use LGPL3.0 licenses.

Key features include:

Information retrieval: Text classification, news clustering.

Chinese Processing: Chinese word segmentation, POS tagging, entity name recognition, keyword extraction, dependency syntax analysis, time phrase recognition.

Structured Learning: Online learning, hierarchical classification, clustering, and precise reasoning.

The tool is written in Java and provides access to the API. The latest version is FudanNLP-1.6.1, which is: http://code.google.com/p/fudannlp/.

The jar in Fudannlp.jar and Lib is deployed to Lib in the project when it is used. Models folder in the model files, mainly used for word segmentation, POS tagging and named entity recognition and Word segmentation required dictionary; folder example is mainly used in the sample code, can help to get started quickly and use; Java-docs is the API help document; SRC stores the source code PDF documents have a more detailed introduction and a basic knowledge of natural language processing.

The initialization time is a bit long when the program is initially run, and the model is loaded with a large amount of memory. The results of the sensory analysis are not very accurate when parsing.

4.The Stanford Natural languageprocessing Group

Stanford NLP Group, a team of natural language processing at Stanford University, has developed several NLP tools, the website of which is: http://nlp.stanford.edu/software/index.shtml. The tools it develops include the following:

4.1 Stanford CORENLP

Java-based processing tool for English, download URL: http://nlp.stanford.edu/software/corenlp.shtml. The main functions include word segmentation, POS tagging, named entity recognition, and grammatical analysis.

I have used it for the speech reduction of English words, the specific application see the article "using Stanford CORENLP to achieve the word-based reduction."

4.2 Stanford Word Segmenter

Using CRF (conditional random field) algorithm for Word segmentation, also based on Java development, and can support Chinese and Arabic, the official requirements of Java version 1.6 or more, recommended memory at least 1G. For http://nlp.stanford.edu/software/segmenter.shtml.

A simple example program:

  1. Sets a word breaker property.
  2. Properties props = new properties ();
  3. Dictionary file addresses, which can be used with absolute paths, such as D:/data
  4. Props.setproperty ("sighancorporadict", "data");
  5. Dictionary compressed package address, can use absolute path
  6. Props.setproperty ("Serdictionary","data/dict-chris6.ser.gz");
  7. Enter the encoding of the text;
  8. Props.setproperty ("inputencoding", "UTF-8");
  9. Props.setproperty ("sighanpostprocessing", "true");
  10. Initializes the word breaker,
  11. Crfclassifier classifier = new Crfclassifier (props);
  12. Load the word breaker settings from the persistence file;
  13. Classifier.loadclassifiernoexceptions ("data/ctb.gz", props);
  14. //flags must be re-set after data is loaded
  15. Classifier.flags.setProperties (props);
  16. Word segmentation
  17. List words = classifier.segmentstring ("statement content");
4.3 Stanford POS Tagger

Written in Java for the English, Chinese, French, Arabic, German named entity recognition Tool, is: http://nlp.stanford.edu/software/tagger.shtml. Has not been contacted, need to study later.

4.4 Stanford Named Entity recognizer

The Named entity tool using the conditional random field model is: http://nlp.stanford.edu/software/CRF-NER.shtml. Has not been contacted, need to study later.

4.5 Stanford Parser

Tools for parsing, supported in English, Chinese, Arabic, and French. As: http://nlp.stanford.edu/software/lex-parser.shtml. The specific use of the introduction see "Using Stanford parser for Chinese grammar analysis."

4.6 Stanford Classifier

The classifier written in Java is: http://nlp.stanford.edu/software/classifier.shtml. Has not been contacted, need to study later.

5.jcseg

jcseg is a lightweight Chinese word breaker based on MMSEG algorithm, which integrates key word extraction, key phrase extraction, key sentence extraction and automatic summarization of articles, and provides a Web server based on jetty, which facilitates the direct HTTP invocation of the major languages. At the same time provide the latest version of Lucene, SOLR, Elasticsearch word-breaker interface! jcseg comes with a jcseg.properties file for quick configuration and get suitable for different occasions of the word breaker, such as: the maximum matching word length, whether to open Chinese name recognition, whether to append pinyin, append synonyms and so on!

Project Address: Https://github.com/lionsoul2014/jcseg

Java Natural Language Processing NLP Toolkit

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.