The use of Golang participle tool sego

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

The library used is: Github.com/huichen/sego


Loading dictionaries
var segmenter sego. Segmenter
Self-brought word breaker information
Segmenter. Loaddictionary ("../github.com/huichen/sego/data/dictionary.txt")

Word segmentation
Text: = []byte ("Use it for rapid development, and it's a true compilation language, we're now opening it up because we think it's very useful and powerful")
Segments: = Segmenter. Segment (text)

Processing Word Segmentation Results
Support common mode and search mode two participle, see the code of segmentstostring function comments.
Fmt. Println (Sego. Segmentstostring (segments, false))

You can filter sensitive words using the following methods
VAR (
Replaceto = "*"
Replacebyte = []byte (Strings. Repeat (Replaceto, 1024))
)
Func (self *wordfilter) Filter (Input string) (string, error) {
Bin: = []byte (input)
Segments: = Self.segmenter.Segment (BIN)
CleanString: = Make ([]byte, 0, Len (bin))
For _, SEG: = Range Segments {
Word: = bin[seg. Start (): SEG. End ()]
Fmt. Println (seg. Token (). Text ())
If Self.dirtywords[strings. ToUpper (String (Word))] {
cleanstring = Append (cleanstring, Replacebyte[:utf8. Runecount (Word)] ...)
} else {
cleanstring = append (cleanstring, Word ...)
}
}
return string (cleanstring), nil
}
Where Dirtywords is the Map[string]bool type and whether it is a sensitive word.


Dictionary.txt Thesaurus File: It was copied from the Github.com/fxsjy/jieba. The file is separated by an empty space, the first is the word, the second is the frequency, and the third is the part of speech. Sego By default does not load words with less than 2 word frequency;


There is no way to add additional words, you can change the method of loading thesaurus in Github.com/huichen/sego/segmenter.go to Loaddictionary from NoSQL (personal feeling MongoDB is the most suitable, no need to put in relational database In Riga, adding a new word to add a method is better.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.