Lucene series: (6) Word breaker

Last Update:2016-09-14 Source: Internet

Author: User

Tags split words

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1, What is the word breaker

Using an algorithm, the characters in the Chinese and English texts are split to form a vocabulary to be searched after the user enters the key Word.

2, Why to the word breaker

Because the user entered the content of the search is a text in a key word, and the original table in the content of the difference, but as a search engine, but also to the relevant content search out, this time you have to use a word breaker to maximize the match of the original table Content.

3. Word breaker Work flow

(1) split words by word breaker

(2) removal of discontinued words and forbidden words

(3) If there is english, the English letter to lowercase, that is, the search is not case-sensitive

4, Demo Common word breaker test

Here the test needs to introduce Ikanalyzer3.2.0stable.jar

package com.rk.lucene.b_analyzer;import java.io.stringreader;import  org.apache.lucene.analysis.analyzer;import org.apache.lucene.analysis.tokenstream;import  Org.apache.lucene.analysis.cjk.cjkanalyzer;import org.apache.lucene.analysis.cn.chineseanalyzer;import  org.apache.lucene.analysis.fr.FrenchAnalyzer;import org.apache.lucene.analysis.ru.RussianAnalyzer; import org.apache.lucene.analysis.standard.standardanalyzer;import  org.apache.lucene.analysis.tokenattributes.termattribute;import org.wltea.analyzer.lucene.ikanalyzer; import com.rk.lucene.utils.luceneutils;/** *  test the word segmentation effect of lucene built-in and third-party word breakers  */public class  testanalyzer {private static void testanalyzer (analyzer analyzer, string  text)  throws exception{system.out.println ("currently used word breaker:"  + analyzer.getclass ()); Tokenstream tokenstream = analyzer.tokenstream ("content",  new stringreader (text)); Tokenstream.addattribute (termattribute.class); while (tokenstream.incrementtoken ()) {termattribute  Termattribute = tokenstream.getattribute (termattribute.class); System.out.print (termattribute.term ()  +  " |");} System.out.println ();} Public static void main (string[] args)  throws exception {// Lucene built-in word breaker System.out.println ("------------------------lucene built-in word breaker"); testanalyzer (new standardanalyzer ( Luceneutils.getversion ()),  "mozilla firefox, Chinese commonly known as" Firefox ", is a free and open source web browser it ah"); testanalyzer (new  Frenchanalyzer (luceneutils.getversion ()),  "mozilla firefox, Chinese commonly known as" Firefox ", is a free and open source web browser it ah"); Testanalyzer (new frenchanalyzer (luceneutils.getversion ()),  "mozilla firefox, Chinese commonly known as" Firefox ", is a free and open source Web browser how are you "); testanalyzer (new russiananalyzer (luceneutils.getversion ()),   "mozilla firefox, Chinese commonly known as" Firefox ", is a free and open source web browser It ah"); testanalyzer (new cjkanalyzer ( Luceneutils.getversion (),  "mozilla firefox, Chinese commonly known as" Firefox ", is a free and open source web browser It ah"); testanalyzer (new chineseanalyzer (),  " mozilla firefox, Chinese commonly known as "firefox", is a free and open source Web browser It ah ");//ikanalyzer System.out.println ("---------------------- --ikanalyzer word breaker "); testanalyzer (new ikanalyzer (), " mozilla firefox, Chinese commonly known as "firefox", is a free and open source web browser it AH Testanalyzer (new ikanalyzer (),  "shanghai tap water from the sea");}}

Output result:

------------------------the current word breaker used by the:class  built-in word breaker org.apache.lucene.analysis.standard.standardanalyzermozilla |firefox |  | Wen  | vulgar  |, said  | Fire  | Fox  | is  | a  |  | from  | by  | and  |  | release  | source  |  | code   |  | Web  | page  |  |  |  | Yes  | the current use of the word breaker:class  org.apache.lucene.analysis.fr.frenchanalyzermozill |firefox |  |  | Vulgar  | called  | fire   | Fox  | is  | a  |  |  | from  | and  |  | the  | source  | by the  | of the  |.  | page  |  |  |  |it | Yes  | the current use of the word breaker:class  org.apache.lucene.analysis.fr.frenchanalyzermozill |firefox |  |  | Vulgar  | called  | fire   | Fox  | is  | a  |  |  | from  | and  |  | the  | source  | by the  | of the  |.  | page  |  |  |  |how |are |you | Yes  | the current use of the word breaker:class  Org.apache.lucene.analysis.ru.RussianAnalyzermozilla |firefox | Chinese commonly known as  | Firefox  | is a free and open source Web browser It ah  | the current use of the word breaker:class  org.apache.lucene.analysis.cjk.cjkanalyzermozilla |firefox | Chinese  | vulgar  | commonly known as  | Firefox  | is a  | a  | self- | free  | from  | and open  | open  | source  |  | Code  | Network  | Web page  | page  | Browse  |  | Yes  | the current use of the word breaker:class  org.apache.lucene.analysis.cn.chineseanalyzermozilla |firefox |  | Wen  | vulgar  | called  | fire  | Fox  | is  | a  |  | from  |  |  | to  |  | source  |  | code  |   | Web  | page  |  |  |  | Yes  |------------------------ikanalyzer word breaker currently in use:class  org.wltea.analyzer.lucene.ikanalyzermozilla |firefox | Chinese  | commonly known as  | Firefox  | a  | a  |  | free  | Open source  | open  | Source code  | code  | web  | Browser  | Browse  | Yes  | current use of the Word Breaker: class  org.wltea.analyzer.lucene.ikanalyzer Shanghai  | tap water  |  | from  | Sea  |

5, the use of Third-party Ikanalyzer word breaker (chinese preferred)

Requirements: filter out the above example of "say", "the", "ah", and "preach wisdom podcast" as a whole key word

(1) Import Ikanalyzer Word breaker core jar package, Ikanalyzer3.2.0stable.jar

(2) Copy the IKAnalyzer.cfg.xml and Stopword.dic and mydict.dic files to the SRC directory of myeclipse,

configuration, the first line requires a blank line when configured

Testik.java

package com.rk.lucene.c_ik;import java.io.stringreader;import  org.apache.lucene.analysis.analyzer;import org.apache.lucene.analysis.tokenstream;import  org.apache.lucene.analysis.tokenattributes.termattribute;import org.wltea.analyzer.lucene.ikanalyzer; Public class testik {private static void testanalyzer (Analyzer analyzer,  string text)  throws exception{system.out.println ("currently used word breaker:"  +  Analyzer.getclass ()); Tokenstream tokenstream = analyzer.tokenstream ("content",  new stringreader (text)); Tokenstream.addattribute (termattribute.class); while (tokenstream.incrementtoken ()) {termattribute  Termattribute = tokenstream.getattribute (termattribute.class); System.out.print (termattribute.term ()  +  " |");} System.out.println ();} Public static void main (string[] args)  throws exception {testanalyzer (new  ikanalyzer (),  " mozilla firefox, the Chinese commonly known as "firefox", is a free and open source Web browser It ah "); testanalyzer (new ikanalyzer (), " Shanghai tap water from the sea "); Testanalyzer (new ikanalyzer (),  "preach Wisdom Podcast Education Technology Co., Ltd. is a dedicated to high-quality software development personnel training of it companies"); Testanalyzer (new  Ikanalyzer (),  "how are you");}}

Output result:

Currently used word breakers: class Org.wltea.analyzer.lucene.IKAnalyzermozilla |firefox | chinese | popularly known as Firefox | a | one | a | free | open Source | open | Source code | code | Web page | browser | browse | | | The currently used word Breaker: class Org.wltea.analyzer.lucene.IKAnalyzer Shanghai | tap water | from | The SEA | The currently used word Breaker: class Org.wltea.analyzer.lucene.IKAnalyzer Podcast | education | education | technology | limited | limited | company | one | | home | dedicated | committed to | high Quality | quality | software Development | software | development | Talent Training | talent | training | company | Current word breaker: class Org.wltea.analyzer.lucene.IKAnalyzerhow |you |

IKAnalyzer.cfg.xml

<?xml version= "1.0" encoding= "UTF-8"? ><! DOCTYPE Properties SYSTEM "http://java.sun.com/dtd/properties.dtd" > <properties> <comment>ik Analyzer Extended configuration </comment><!--users can configure their own extension dictionaries here--<entry key= "ext_dict" >/mydict.dic;</entry> <!-- Users can configure their own extension stop word dictionary here--><entry key= "ext_stopwords" >/ext_stopword.dic</entry> </properties>

Mydict.dic

Wisdom Podcast

Ext_stopword.dic

And still from the P.S. to make it right and to give it, and to be let it be compared with the

Lucene series: (6) word breaker

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More