Participle Java Open source Chinese word breaker ansj_seg initial trial

Source: Internet
Author: User
Tags creative commons attribution

Recently, we need to comment on the public comments on the site 600,000 + the semantic analysis, so must use Word breaker tool, just started when I was choosing to use NLPIR Chinese word segmentation system (aka ICTCLAS2014), Nlpir tutorial in the [participle]nlpir/ ICTCLAS2014 Word system C + + API in Windows for the first time use, but the intuitive feeling word segmentation effect is not ideal, so the use of another tool, students recommend me to use the Chinese word ansj, recently also learning Java, So for the Java jar package is still quite acceptable, not like DLL so annoying, in eclipse directly to add the package classpath is a very simple thing. Plus the AutoFill feature, it's not a problem at all.

Download the jar package:

GitHub home: Ansj participle. The true Java implementation of ICTs. The participle effect is faster than the open source version of ICT. Chinese word segmentation, name recognition, POS tagging, user-defined dictionaries

    • Visit http://maven.ansj.org/org/ansj/Best Download the latest version ansj_seg/
      • If you are using a 1.x version, you need to download Tree_split.jar.
      • If you're using a 2.x version, you need to download Nlp-lang.jar
    • Import to Eclipse and start your program.

To hair Bovinge, I downloaded the latest version already exists in Baidu Cloud Share link: Http://pan.baidu.com/s/1sjuKMvV password: VCOF, one of which is 1.x version has been equipped with Tree_ Split.jar, the other is 2.x, has been equipped with Nlp-lang.jar. Download is available.

ANSJ User manual:

http://ansjsun.github.io/ansj_seg/

API call Mode:

Basic participle Invocation Method:

Basically is to ensure that the most basic word segmentation. The most important word particle size. The words involved are about 100,000 or so.

Basic word segmentation speed is very fast. On the MacAir. can be up to 300w per second per second. At the same time, the accuracy rate is very high. But he has a very limited ability to do new words.


List<term> parse = Baseanalysis.parse ("Let the Warriors have a happy and peaceful Spring Festival.") "); System.out.println (parse); result:[let /V, Warriors/N, we/k, over/ug, A/M, Joy/A, peace/A,/uj, Spring/T, festival/N,. /w]

Accurate Word segmentation method (recommended by the store chief)

Accurate word segmentation is the ansj word of the store long recommendation

It's in ease of use, stability. accuracy. And the word segmentation efficiency. have made a good balance.

If you first appreciate ANSJ if you want to get out of the box, then it's not wrong to use this word-breaking method.

List<term> parse = Toanalysis.parse ("Let the Warriors have a happy and peaceful Spring Festival.") ");    System.out.println (parse);
NLP participle invocation mode

NLP participle is always a way to surprise you.

It can identify non-signed words. But it also has its drawbacks. Slower stability. PS: I'm talking slow here, just comparing myself with the other way around. It should be the speed of 40w words per second.

Individuals feel the application of NLP. 1. Syntax entity name extraction. No sign-in Word collation. As long as the text is found to analyze the work

If you do not want to add the output parts of speech standard, you can refer

Using Word2vec to cluster keywords


This document is licensed based on the Creative Commons Attribution-NonCommercial Use 3.0 license agreement. Welcome reprint, Deduction, but must keep the signature of this article Lin Yu flying, if you need advice, please send me a letter

Participle Java Open source Chinese word breaker ansj_seg initial trial

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.