Microsoft Interview 100 Series: string matching algorithm to find substrings that contain a character set

Source: Internet
Author: User

Think this topic is very interesting, read other blog, found a kind of algorithm that seems to be good at present, for strengthening understanding, wrote down.

Topic Meaning:

Implement a very advanced character matching algorithm:

Given a string of strings that require a string to be found, for example, for the destination string: 123, the substring in the given string, such as 1******3*****2,12******3, is to be found, that is, all the characters in the substring that contain the destination string, and the output of all strings that match the condition. and find the shortest substring. Similar to a harmonious system.

For example: If the destination string is: "423", the input long string is: "4fsdfk2jfl3fd2jfksd3j4d4d4jkfd4jd3kdf2", you should output "4fsdfk2jfl3", "2jfksd3jld4", "4JD3KDF2". The string with the smallest length is also output.

Basic ideas:

With two variables front,rear points to the head and tail of a substring interval (of course, at the beginning front and rear all point to the beginning of the string). Use an int cnt[255]={0} to record the number of character sets A,b,c in the current substring, and the number of variables in the Count record character set. Rear always add, update cnt[] and count values until count equals the number of character sets. Then front++, until the number of characters in cnt[] is 0 (the beginning of the front is likely to repeat the following, so front to be added to a character number of 0, i.e.

The character will not be duplicated at the end, so we can find a string that matches the condition, and then we can get all the strings that meet the condition, and also the shortest substring that satisfies the condition.

Key points for implementing the algorithm: (combined with code description)

1) distinguish between matching characters and ordinary characters.

Here a hash table is used to record the matching characters. Of course, because the range of characters is 0-255, using an int array of length 256 can represent a collection of matching characters. Of course, to save space, using Bitset can also achieve the same effect.

2) scan the string, and once the matching characters are scanned, do the following:

1, the character has never been scanned, then count++, indicating the number of matched characters scanned count=+1, at the same time the number of characters cnt[char]=+1

2, the character has been scanned, then the number of characters cnt[char]=+1

3) Once the number of matched characters scanned is counted equal to the number of character sets, a matching string is generated, and the shortest substring needs to be extracted from it, and the extraction method is the focus here, as follows:

1, starting from front scanning characters, if the scan to non-character set characters, continue to scan, if the characters in the character set, according to the number of occurrences of their characters to determine the next operation.

2, if the character set character appears only once, then the character is the starting character of the substring, the direct output, the completion of the extraction of characters.

3, if the character appears multiple times, indicating that the sub-string from the beginning can not be the shortest substring, the number of occurrences of the character cnt[char]-1, while front++, continue to scan backwards, repeat step 1. Paint illustrates the reason for this step:

4) After determining a possible shortest substring, the need to continue the backward scan, front need to move backwards, because the front before moving the character pointed to only once, after the move, the number of matching characters count-1. Drawing instructions

The above is the key point in the algorithm, the following code:

#include <iostream>#include<cstring>using namespacestd;voidMinsubstring (Char*SRC,Char*des) {    intmin= +;//find the shortest substring    intMinfront=0;//Shortest substring start position    intMinrear=0;//Shortest substring End position    intfront,rear; Front=rear=0; intlen=strlen (DES); inthashtable[255]={0}; intcnt[255]={0};  for(intI=0; i<len; i++)//map characters in the character set to the Hashtable array to determine if a character in Src is in the character sethashtable[* (des+i)]=1; intCount=0; Char*p=src;  while(* (p+rear)! =' /')    {        if(hashtable[* (p+rear)]==1)//rear The current character in the character set        {            if(cnt[* (p+rear)]==0)//determines whether this character is retrieved for the first time in a string, the number of characters that have occurred in the Count statistics character set{Count++; cnt[* (p+rear)]++; if(count = len)//characters in the character set are retrieved in the book string                {                    //extracting the shortest substring from front to rear                     while(1)                    {                        if(hashtable[* (P+front)]==1)//front the current character in the character set                        {                            //The first picture shows the principle of extraction .cnt[* (P+front)]--; if(cnt[* (P+front)]==0)//a character in the character set is 0, at which point the string front to rear is a substring that matches the condition                            {                                 for(intI=front; i<=rear; i++)//Print this substringcout<<* (p+i); cout<<Endl; if(rear-front+1<min) {min=rear-front+1; Minrear=Rear; Minfront=Front; }                                //start a new substring, starting with a new substring from the character following the front//Front is a character within the character set, and after front++, one character is missing, so//to Count--. The second picture reflects the processcount--; Front++;  Break; }} Front++; }                }            }            Elsecnt[* (p+rear)]++; }        //the current character is not in the character setrear++; } cout<<"Shortest Sub-string:";  for(intI=minfront; i<=minrear; i++) cout<<* (p+i); cout<<Endl;} intMain () {//Char *src= "ab1dkj2ksjf3ae32ks1iji2sk1ksl3ab;iksaj1223"; //Char *src= "2sk1ksl3ab;iksaj1223"; //Char *src= "ab1dkj2ksjf3ae32ks1iji2sk1ksl3ab;1ik3saj123";    Char*src="adhe1jk2jk2jkj1jk2jksd2mjkl3jk1kj4lkkj"; Char*des="1234";     Minsubstring (src, des); return 0;}

Reference article:

http://blog.csdn.net/cxllyg/article/details/7595878, Bo Master to provide a good idea, Leonlovezh in the comments to improve ideas, think the idea of improvement is more concise, this article adopted the idea.

Http://www.cnblogs.com/tractorman/p/4064054.html

Microsoft Interview 100 series: string matching algorithm to find substrings that contain a character set

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.