This is a creation in Article, where the information may have evolved or changed.
The library used is: Github.com/huichen/sego
Loading dictionaries
var segmenter sego. Segmenter
Self-brought word breaker information
Segmenter. Loaddictionary ("../github.com/huichen/sego/data/dictionary.txt")
Word segmentation
Text: = []byte ("Use it for rapid development, and it's a true compilation language, we're now opening it up because we think it's very useful and powerful")
Segments: = Segmenter. Segment (text)
Processing Word Segmentation Results
Support common mode and search mode two participle, see the code of segmentstostring function comments.
Fmt. Println (Sego. Segmentstostring (segments, false))
You can filter sensitive words using the following methods
VAR (
Replaceto = "*"
Replacebyte = []byte (Strings. Repeat (Replaceto, 1024))
)
Func (self *wordfilter) Filter (Input string) (string, error) {
Bin: = []byte (input)
Segments: = Self.segmenter.Segment (BIN)
CleanString: = Make ([]byte, 0, Len (bin))
For _, SEG: = Range Segments {
Word: = bin[seg. Start (): SEG. End ()]
Fmt. Println (seg. Token (). Text ())
If Self.dirtywords[strings. ToUpper (String (Word))] {
cleanstring = Append (cleanstring, Replacebyte[:utf8. Runecount (Word)] ...)
} else {
cleanstring = append (cleanstring, Word ...)
}
}
return string (cleanstring), nil
}
Where Dirtywords is the Map[string]bool type and whether it is a sensitive word.
Dictionary.txt Thesaurus File: It was copied from the Github.com/fxsjy/jieba. The file is separated by an empty space, the first is the word, the second is the frequency, and the third is the part of speech. Sego By default does not load words with less than 2 word frequency;
There is no way to add additional words, you can change the method of loading thesaurus in Github.com/huichen/sego/segmenter.go to Loaddictionary from NoSQL (personal feeling MongoDB is the most suitable, no need to put in relational database In Riga, adding a new word to add a method is better.