Chinese word segmentation algorithm-Baidu face test

Source: Internet
Author: User

Topic:
Given a string, an array, to determine whether the string can be separated into a dictionary of words.

Using dynamic programming algorithm
I wrote the following code during the interview.

 Public Static BooleanDivied2 (String s,string[] dict) {Booleanresult=false;if(s.length () = =0)return true; for(inti =0; i < dict.length; i++) {int Index=s.indexof (Dict[i]);if(Index!=-1) {System.out.println (Index); String tmp1=s.substring (0,Index); String tmp2=s.substring (Index+dict[i].length (), s.length ());returnDivied (tmp1+tmp2,dict); }        }returnResult }

But for test cases

String[]  dict={"百度一","百度","一下","我就","知","道"};        System.out.println(divied2("百度一下我就知道", dict));

This is a non-pass. Because Baidu first deleted the word, the word was destroyed,

Back to think about it, the reason above is to terminate the traversal. After improvement, the test passes
The original problem is that this |= operation, which means that all the results are performed or manipulated, one can be separated completely.

     Public Static BooleanDivied (String s,string[] dict) {Booleanresult=false;if(s.length () = =0)return true; for(inti =0; i < dict.length; i++) {int Index=s.indexof (Dict[i]);if(Index!=-1) {System.out.println (Index); String tmp1=s.substring (0,Index); String tmp2=s.substring (Index+dict[i].length (), s.length ());            Result|=divied (tmp1+tmp2,dict); }        }returnResult }

The disadvantage is that time complexity is too high,
String length is M, dictionary size is n
The complexity of time is:
Around n^ (m)

 Public Static BooleanDivied (String s,string[] dict) {Booleanresult=false;if(s.length () = =0)return true; for(inti =0; i < dict.length; i++) {Count++;int Index=s.indexof (Dict[i]);if(Index!=-1) {System.out.println (Index); String tmp1=s.substring (0,Index); String tmp2=s.substring (Index+dict[i].length (), s.length ()); Result|=divied (tmp1+tmp2,dict);if(Result) {//Optimization point                    return true; }            }        }returnResult }

Optimization ideas. Terminates the loop directly in the case of Result=true.
Add a global variable to see the number of function executions
Without interruption
The function was performed about 180 times (in relation to the sequence of the lyrics).
After the interrupt has been added.
The function executes only 21 times.

In the case where the dictionary order is constantly adjusted, if the string can be separated completely, the function executes only about 30 times. But if not, it will do 374.

Optimization Idea Two:
If each word appears only once in the string, you can delete the word in the dictionary directly after it is found and deleted, so that unnecessary loops can be avoided

Optimization Idea Three:
The fact or operation is designed for a word that has a different length from the beginning of the same character in the dictionary. This can be a bit more specific in the program.

After the improvement, the effect is very good. The time complexity of the two cases is essentially the same, regardless of whether the string can be separated completely.
For the following test cases: can be separated, executed 44 times, can not be the case is 60 times
The complexity of time is reduced to the factorial of the dictionary length.

String[]  dict={"百度一","一下","知","我就","百度","道"};        System.out.println(divied("百度一下我就知道", dict));
String[]  dict={"百度一","一下","知","我就","百度","道"};        System.out.println(divied("百度一下我后就知道", dict));
 Public Static BooleanDivied (String s,string[] dict) {Booleanresult=false;if(s.length () = =0)return true;Charstart=' + '; for(inti =0; i < dict.length; i++) {Count++;int Index=s.indexof (Dict[i]);if(start==' + '&&Index!=-1||Index!=-1&&dict[i].charat (0) {System.out.println (==start) {Index); String tmp1=s.substring (0,Index); String tmp2=s.substring (Index+dict[i].length (), s.length ()); Start=dict[i].charat (0); Result|=divied (tmp1+tmp2,dict);if(Result) {return true; }            }        }returnResult }

Optimization Idea four:
For the improvement of train of thought three, recursion is performed only in the case of words with repetition, other deletions, continuation loops, no recursion

 Public  class Divide {    Static int Count=0; Public Static BooleanDivied (String s,string[] dict) {Booleanresult=false;if(s.length () = =0)return true;Charstart=' + '; for(inti =0; i < dict.length; i++) {Count++;int Index=s.indexof (Dict[i]);if(start==' + '&&Index!=-1) {String tmp1=s.substring (0,Index); String tmp2=s.substring (Index+dict[i].length (), s.length ());                S=TMP1+TMP2; Start=dict[i].charat (0); }if(Index!=-1&&dict[i].charat (0) {==start) {String tmp1=s.substring (0,Index); String tmp2=s.substring (Index+dict[i].length (), s.length ());                S=TMP1+TMP2; Result|=divied (tmp1+tmp2,dict);if(Result) {return true; }            }        }returnResult } Public Static voidMain (string[] args) {string[] dict={"Baidu One","Baidu","I will.","a bit","Know","Tao"}; System.out.println (Divied ("Baidu, I know it.", dict)); System.out.println (Count); }}

The final result, the number of cycles that can be completely separated, is 6
For those that cannot be completely delimited, the number of loops is 18

Chinese word segmentation algorithm-Baidu face test

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.