Research on ICTCLAS Segmentation System (IV)-initial Segmentation

Source: Internet
Author: User

 

After atomic word segmentation, the source string is an independent smallest element unit. The first split below is to find out all possible combinations between atoms first. The algorithm is implemented using two loops. The first layer traverses the entire atomic unit. The second layer is to combine the adjacent atoms and the atom when an atom is found, access the dictionary library to see if it can constitute a meaningful phrase.

The mathematical method can be described as follows:

There is an atomic sequence: A (n) (0 <= n <m) (where m is the length of atomic sequence ). When I = n, determine whether AnAn + 1 .. Ap is a phrase, where n <p <m.

Use a pseudo code:

For (int I = 0; I <m; I ++ ){

String s = A [I];

For (int j = I + 1; j <m; j ++ ){

S + = A [j];

If (s is a phrase ){

Add s to the initial splitting list;

Record the part of speech of the phrase;

Record the coordinates and other information in the table where the phrase is located;

}

Else

Break;

}

}

The data structure after initial splitting is shown in the following figure:

 

Figure 1

The word segmentation case "he said is true" is shown in the result after the first split:

 

 

Figure 2

The linked list structure in Figure 1 is shown in figure 2:

Figure 3

From the three aspects, we can see that in the sequence table, the phrase after the first split has the same word in the same row, the last word in the same column, and the original atom in the symmetry axis.

The source code for processing the above process is as follows:

Bool CSegment: BiSegment (char * sSentence, double dSmoothingPara, CDictionary & dictCore, CDictionary & dictBinary, unsigned int nResultCount)
{

......

// Complete the processing result here to generate a linked list Structure

M_graphSeg.GenerateWordNet (sSentence, dictCore, true); // Generate words array

......

After the table structure shown in Figure 2 is generated, the binary chart is further generated.

....

// Generate the biword link net

BiGraphGenerate (m_graphSeg.m_segGraph, aBiwordsNet, dSmoothingPara, dictBinary, dictCore );

....

Perform an in-depth analysis on this function:

Bool CSegment: BiGraphGenerate (CDynamicArray & aWord, CDynamicArray & aBinaryWordNet, double dSmoothingPara, CDictionary & DictBinary, CDictionary & DictCore)
{
......

// Obtain the length of the linked list
M_nWordCount = aWord. GetTail (& pTail); // Get tail element and return the words count
If (m_npWordPosMapTable)
{// Free buffer
Delete [] m_npWordPosMapTable;
M_npWordPosMapTable = 0;
}

// Allocate an array to store the location of each node in the figure, as shown in figure 4.
If (m_nWordCount> 0) // Word count is greater than 0
M_npwordposmaptable = new int [m_nwordcount]; // record the position of possible words

// Point the pointer to the beginning of the current linked list, calculate the position of each word, and put it in the array.

Pcur = aword. gethead ();
While (pcur! = NULL) // set the position map of words
{
M_npwordposmaptable [nwordindex ++] = pcur-> row * max_sentence_len + pcur-> Col;
Pcur = pcur-> next;
}

// Traverse all nodes and calculate the smoothing value between adjacent words

Pcur = aword. gethead ();
While (pcur! = NULL )//
{
If (pcur-> NPOs> = 0) // It's not an unknown words
Dcurfreqency = pcur-> value;
Else // unknown words
Dcurfreqency = dictcore. getfrequency (pcur-> sword, 2 );

// Obtain the next node with the same column value as the current node (COL)
AWord. GetElement (pCur-> col,-1, pCur, & pNextWords );
While (pNextWords & pNextWords-> row = pCur-> col) // Next words
{
// Use the @ separator to connect the two words

Strcpy (sTwoWords, pCur-> sWord );
Strcat (sTwoWords, WORD_SEGMENTER );
Strcat (sTwoWords, pNextWords-> sWord );

// Calculate the edge length of the two connected words
NTwoWordsFreq = DictBinary. GetFrequency (sTwoWords, 3 );
// Two linked Words frequency
DTemp = (double) 1/MAX_FREQUENCE;
// Calculate the smoothing Value
DValue =-log (dSmoothingPara * (1 + dCurFreqency)/(MAX_FREQUENCE + 80000) + (1-dSmoothingPara) * (1-dTemp) * Temperature/(1 + dCurFreqency) + dTemp ));
//-Log {A * P (Ci-1) + (1-A) P (CI | Ci-1)} Note 0 <A <1
If (pcur-> NPOs <0) // unknown words: P (WI | CI); While known words: 1
Dvalue + = pcur-> value;

// Get the position index of current word in the position map table
Ncurwordindex = binarysearch (pcur-> row * max_sentence_len + pcur-> Col, m_npwordposmaptable, m_nwordcount );
Nnextwordindex = binarysearch (pnextwords-> row * max_sentence_len + pnextwords-> Col, m_npwordposmaptable, m_nwordcount );

// Insert the position of the current node in the location table and the position and smoothing value of the next node in the location table/part of speech into the binary linked list
Abinarywordnet. setelement (ncurwordindex, nnextwordindex, dvalue, pcur-> NPOs );
Pnextwords = pnextwords-> next; // get next word
}
Pcur = pcur-> next;
}
Return true;
}

Figure 4

The final key table result is shown in Figure 5:

Figure 5

Figure 6 shows the corresponding two-dimensional chart representation:

Figure 6

The small numeric value represents the Coupling Degree between two adjacent words, that is, the probability of forming a larger length word. The smaller the value, the more likely the two words to become independent words.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.