Simple filtering of Common Words in Chinese Word Segmentation

Source: Internet
Author: User
First of all, I would like to express my gratitude to the honor of my friends for the simple Chinese word segmentation program. I began to contact the search engine field and wrote this article without the help of his wonderful article :)
Next, let's start with the question.

Terms: analyzer, tokens, and highlight ).
Implementation Background:
When the source word is written in the search engine text box, analyzer splits the source word into multiple tokens ). Then, the search engine searches for word units in the dictionary for matching, record weights, and other operations.
When some source words include common words, it will often cause troubles for the following work, for example:

When searching for [RICH teaching experience], the word divider splits the source word into [RICH]-[[of]-[teaching]-[experience], because of the highlight) every word unit is colored in the page, so it is necessary to filter out these common words.
Implementation ideas:
You can create a dictionary for a word, or you can also create a dictionary for common words to be filtered out. The source words are filtered before each word segmentation, and then the filtered source words are split.
Implementation:
Knowledge 1: cache configuration files
Many of my friends have J2EE development experience and are familiar with the classic framework struts. Everyone must note that every time you modify the struts-config.xml, You have to restart the server to take effect. Why is it designed like this? Is it better to change the configuration file immediately?
For small applications, reading information from the configuration file every time is indeed a good choice, but when the configuration file is large, for example, used to filter the dictionary of common words, it takes a long time to re-load the dictionary and search from it during each search. So when the application starts, loading the filter dictionary at one time becomes a good solution.
The method for caching configuration files is very simple. You only need to cache the objects that receive filter words (which can be string [], ilist <string>, or anything else :)) declare it as static. The following is the implementation code:

Public class filterdemo
{
Private Static string _ filterpath;
Private Static ilist <string> filter = NULL;

Static filterdemo ()
{
_ Filterpath = httpcontext. Current. Request. mappath ("~ /App_data/file/"+" filter.txt ");
// Obtain the list of strings to be filtered from the specified file
Initfilterfile (_ filterpath );
}
}

The filter is declared as a static variable to receive words in the filter dictionary. After the static method initfilterfile () is executed, the filter is initialized. At this point, the filter dictionary is successfully loaded, and the list of filter words is obtained only from the cache, that is, the filter variable, instead of loading the filter dictionary, which improves the response speed. Of course, if you want to add or modify the content in the dictionary, you must restart the service to make it take effect. Things have one advantage and one disadvantage. The truth is :)
Knowledge 2: reading information from text files
Read Information from text files, check msdn or Google to find a bunch of solutions, so you do not need to make too many statements about the specific implementation process. Here I want to emphasize the format of the file character set, because I did encounter problems when writing code. Let's take a look at the following two methods of coding:
Method 1: Using (streamreader sr = file. opentext (PATH ))
{
Filter = new list <string> ();
String S = "";
While (S = Sr. Readline ())! = NULL)
{
Filter. Add (s );
}
}

Method 2: Using (streamreader sr = new streamreader (path, encoding. Default ))
{
Filter = new list <string> ();
String S = "";
While (S = Sr. Readline ())! = NULL)
{
Filter. Add (s );
}
}

The first method uses the static file method opentext (string path) to read the content into the streamreader object. The second method directly uses the streamreader constructor to implement the same function, the character set of the operation flow is also specified.
The powerful framework provides multiple solutions for implementing the same functions. Method overloading often provides a more accurate path for implementing functions, such as character set parameters in the structure of streamreader, when I used the first method to implement the function, I encountered garbled characters. Besides, I saved the modified content of the available dictionary and used the first method to read it, so I figured out whether I could specify the stream character set in the operation stream, and found the second method above and it worked.
The thought I want to talk about here is the method I want. The framework may already provide services for you and understand the framework style, which will make the function implementation process smoother :)

The last step is to filter the source word, which is very simple: // filter invalid characters, string-Private Static string wordfilter (string s) # region // filter invalid characters, string-Private Static string wordfilter (string S)
/** // <Summary>
/// Filter invalid characters and strings
/// </Summary>
/// <Param name = "S"> source string </param>
Private Static string wordfilter (string S)
{
Foreach (string code in filter)
{
S = S. Replace (Code ,"");
}
Return S;
}
# Endregion

The wordfilter (string s) method returns the filtered source word.
Let's see the results:

Problems:
Some words (or words) in the dictionary will appear in some words, such as [I], and will be filtered out in the idiom [I am sorry, therefore, this idiom will be split into [view]-[even]-[pity].

Later:
So far, there is no unified standard for the search engine technology, and there is no stage in which a certain function can be used to solve the general consensus. Most intra-site searches are in the exploratory and testing phase (of course, this stage brings us endless fun :)), therefore, the above solutions are only immature among the many solutions, but if the above solutions can bring you a little inspiration or arouse your interest in search engines, I will be very happy :)

Complimentary: Filter dictionary

Filter.txt
Of
?
?
Ah
Description
Pair
In
And
Yes
Quilt
Most
Institute
That
This
Yes
Set
Yes
And
Cause
Yu
He
She
It
You
Is
The
For
In
To
On
 



A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
1
2
3
4
5
6
7
8
9
0
About
Above
After
Again
All
Also
AM
An
And
Any
Are
As
At
Back
Be
Been
Before
Behind
Being
Below
But
By
Can
Click
Do
Does
Done
Each
Else
Etc
Ever
Every
Few
For
From
Generally
Get
Go
Gone
Has
Have
Hello
Here
How
If
In
Into
Is
Just
Keep
Later
Let
Like
Lot
Lots
Made
Make
Makes
Bytes
May
Me
More
Most
Much
Must
My
Need
No
Not
Now
Of
Often
On
Only
Or
Other
Others
Our
Out
Over
Please
Put
So
Some
Such
Than
That
The
Their
Them
Then
There
These
They
This
Try
To
Up
Us
Very
Want
Was
We
Well
What
When
Where
Which
Why
Will
With
Within
You
Your
Yourself

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.