Lucene series: (6) Word breaker

Source: Internet
Author: User
Tags split words


1, What is the word breaker

Using an algorithm, the characters in the Chinese and English texts are split to form a vocabulary to be searched after the user enters the key Word.


2, Why to the word breaker

Because the user entered the content of the search is a text in a key word, and the original table in the content of the difference, but as a search engine, but also to the relevant content search out, this time you have to use a word breaker to maximize the match of the original table Content.


3. Word breaker Work flow

(1) split words by word breaker

(2) removal of discontinued words and forbidden words

(3) If there is english, the English letter to lowercase, that is, the search is not case-sensitive


4, Demo Common word breaker test

Here the test needs to introduce Ikanalyzer3.2.0stable.jar


package com.rk.lucene.b_analyzer;import java.io.stringreader;import  org.apache.lucene.analysis.analyzer;import org.apache.lucene.analysis.tokenstream;import  Org.apache.lucene.analysis.cjk.cjkanalyzer;import org.apache.lucene.analysis.cn.chineseanalyzer;import  org.apache.lucene.analysis.fr.FrenchAnalyzer;import org.apache.lucene.analysis.ru.RussianAnalyzer; import org.apache.lucene.analysis.standard.standardanalyzer;import  org.apache.lucene.analysis.tokenattributes.termattribute;import org.wltea.analyzer.lucene.ikanalyzer; import com.rk.lucene.utils.luceneutils;/** *  test the word segmentation effect of lucene built-in and third-party word breakers  */public class  testanalyzer {private static void testanalyzer (analyzer analyzer, string  text)  throws exception{system.out.println ("currently used word breaker:"  + analyzer.getclass ()); Tokenstream tokenstream = analyzer.tokenstream ("content",  new stringreader (text)); Tokenstream.addattribute (termattribute.class); while (tokenstream.incrementtoken ()) {termattribute  Termattribute = tokenstream.getattribute (termattribute.class); System.out.print (termattribute.term ()  +  " |");} System.out.println ();} Public static void main (string[] args)  throws exception {// Lucene built-in word breaker System.out.println ("------------------------lucene built-in word breaker"); testanalyzer (new standardanalyzer ( Luceneutils.getversion ()),  "mozilla firefox, Chinese commonly known as" Firefox ", is a free and open source web browser it ah"); testanalyzer (new  Frenchanalyzer (luceneutils.getversion ()),  "mozilla firefox, Chinese commonly known as" Firefox ", is a free and open source web browser it ah"); Testanalyzer (new frenchanalyzer (luceneutils.getversion ()),  "mozilla firefox, Chinese commonly known as" Firefox ", is a free and open source Web browser how are you "); testanalyzer (new russiananalyzer (luceneutils.getversion ()),   "mozilla firefox, Chinese commonly known as" Firefox ", is a free and open source web browser It ah"); testanalyzer (new cjkanalyzer ( Luceneutils.getversion (),  "mozilla firefox, Chinese commonly known as" Firefox ", is a free and open source web browser It ah"); testanalyzer (new chineseanalyzer (),  " mozilla firefox, Chinese commonly known as "firefox", is a free and open source Web browser It ah ");//ikanalyzer System.out.println ("---------------------- --ikanalyzer word breaker "); testanalyzer (new ikanalyzer (), " mozilla firefox, Chinese commonly known as "firefox", is a free and open source web browser it AH Testanalyzer (new ikanalyzer (),  "shanghai tap water from the sea");}}


Output result:

------------------------the current word breaker used by the:class  built-in word breaker org.apache.lucene.analysis.standard.standardanalyzermozilla |firefox |  | Wen  | vulgar  |, said  | Fire  | Fox  | is  | a  |  | from  | by  | and  |  | release  | source  |  | code   |  | Web  | page  |  |  |  | Yes  | the current use of the word breaker:class  org.apache.lucene.analysis.fr.frenchanalyzermozill |firefox |  |  | Vulgar  | called  | fire   | Fox  | is  | a  |  |  | from  | and  |  | the  | source  | by the  | of the  |.  | page  |  |  |  |it | Yes  | the current use of the word breaker:class  org.apache.lucene.analysis.fr.frenchanalyzermozill |firefox |  |  | Vulgar  | called  | fire   | Fox  | is  | a  |  |  | from  | and  |  | the  | source  | by the  | of the  |.  | page  |  |  |  |how |are |you | Yes  | the current use of the word breaker:class  Org.apache.lucene.analysis.ru.RussianAnalyzermozilla |firefox | Chinese commonly known as  | Firefox  | is a free and open source Web browser It ah  | the current use of the word breaker:class  org.apache.lucene.analysis.cjk.cjkanalyzermozilla |firefox | Chinese  | vulgar  | commonly known as  | Firefox  | is a  | a  | self- | free  | from  | and open  | open  | source  |  | Code  | Network  | Web page  | page  | Browse  |  | Yes  | the current use of the word breaker:class  org.apache.lucene.analysis.cn.chineseanalyzermozilla |firefox |  | Wen  | vulgar  | called  | fire  | Fox  | is  | a  |  | from  |  |  | to  |  | source  |  | code  |   | Web  | page  |  |  |  | Yes  |------------------------ikanalyzer word breaker currently in use:class  org.wltea.analyzer.lucene.ikanalyzermozilla |firefox | Chinese  | commonly known as  | Firefox  | a  | a  |  | free  | Open source  | open  | Source code  | code  | web  | Browser  | Browse  | Yes  | current use of the Word Breaker: class  org.wltea.analyzer.lucene.ikanalyzer Shanghai  | tap water  |  | from  | Sea  |



5, the use of Third-party Ikanalyzer word breaker (chinese preferred)

Requirements: filter out the above example of "say", "the", "ah", and "preach wisdom podcast" as a whole key word


(1) Import Ikanalyzer Word breaker core jar package, Ikanalyzer3.2.0stable.jar

(2) Copy the IKAnalyzer.cfg.xml and Stopword.dic and mydict.dic files to the SRC directory of myeclipse,

configuration, the first line requires a blank line when configured


Testik.java

package com.rk.lucene.c_ik;import java.io.stringreader;import  org.apache.lucene.analysis.analyzer;import org.apache.lucene.analysis.tokenstream;import  org.apache.lucene.analysis.tokenattributes.termattribute;import org.wltea.analyzer.lucene.ikanalyzer; Public class testik {private static void testanalyzer (Analyzer analyzer,  string text)  throws exception{system.out.println ("currently used word breaker:"  +  Analyzer.getclass ()); Tokenstream tokenstream = analyzer.tokenstream ("content",  new stringreader (text)); Tokenstream.addattribute (termattribute.class); while (tokenstream.incrementtoken ()) {termattribute  Termattribute = tokenstream.getattribute (termattribute.class); System.out.print (termattribute.term ()  +  " |");} System.out.println ();} Public static void main (string[] args)  throws exception {testanalyzer (new  ikanalyzer (),  " mozilla firefox, the Chinese commonly known as "firefox", is a free and open source Web browser It ah "); testanalyzer (new ikanalyzer (), " Shanghai tap water from the sea "); Testanalyzer (new ikanalyzer (),  "preach Wisdom Podcast Education Technology Co., Ltd. is a dedicated to high-quality software development personnel training of it companies"); Testanalyzer (new  Ikanalyzer (),  "how are you");}}

Output result:

Currently used word breakers: class Org.wltea.analyzer.lucene.IKAnalyzermozilla |firefox | chinese | popularly known as Firefox | a | one | a | free | open Source | open | Source code | code | Web page | browser | browse | | | The currently used word Breaker: class Org.wltea.analyzer.lucene.IKAnalyzer Shanghai | tap water | from | The SEA | The currently used word Breaker: class Org.wltea.analyzer.lucene.IKAnalyzer Podcast | education | education | technology | limited | limited | company | one | | home | dedicated | committed to | high Quality | quality | software Development | software | development | Talent Training | talent | training | company | Current word breaker: class Org.wltea.analyzer.lucene.IKAnalyzerhow |you |


IKAnalyzer.cfg.xml

<?xml version= "1.0" encoding= "UTF-8"? ><! DOCTYPE Properties SYSTEM "http://java.sun.com/dtd/properties.dtd" > <properties> <comment>ik Analyzer Extended configuration </comment><!--users can configure their own extension dictionaries here--<entry key= "ext_dict" >/mydict.dic;</entry> <!-- Users can configure their own extension stop word dictionary here--><entry key= "ext_stopwords" >/ext_stopword.dic</entry> </properties>


Mydict.dic

Wisdom Podcast


Ext_stopword.dic

And still from the P.S. to make it right and to give it, and to be let it be compared with the








Lucene series: (6) word breaker

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.