String correlation (sort, data structure)

Objective

In the fifth chapter of the algorithm, is related to the string of various processing operations, in peacetime processing, in fact, found all the language, can not be separated from the string, and even the value and so on related operations may also be converted to string related operations, all the data, in the corresponding language processing, are strings.

The scope of application is so wide, but in Java does not do the corresponding special processing of strings (here refers only to the ordering, data structure of these two aspects.)

Sort key Index Sort

Before you sort the strings, let's look at another interesting and useful sort of sorting.

In the general context of the data, using hill Sort, merge sort, insert sort, quick sort, three-direction quick sort these kinds of things have been able to meet our normal needs.

But another problem that has to be noticed is that the implementation of the CompareTo () method, in the previous sort, always needs to be compared and then sorted. The sort is after the comparison.

The more complex data structures are treated, the higher the cost of comparison. Consider a common scenario in which the whole school is sorted by class.

A little analysis will find: Key is the class number, value is the student. Class number only R, the number is limited, and the value is not too large, the number of students is far greater than the key. Do you realize the comparator excuse for the student class? Then a three-way quick sort is used to sort the more repetitive data.

This is one way, but here's another more interesting sort of method, which is interesting in that this method can be sorted without comparison.

No comparison? How to sort? How to know who is big and who is small.

Here, in fact, the use of the default comparison, 1-10 itself is small to large data. Step by step to see how it is achieved.

`int[] count = new int[R + 1];for (int i = 0; i < students.length; i++) { count[students[i].getClassNumber() + 1]++;}`

This step is to count the class frequency of all the data. The number of occurrences is stored in the array. The significance of this is:

We can intuitively know, Class 1 How many individuals, 2 classes have how many people ..., if you take the students[1], found that the class is 2, assuming 1 classes 20 people, 2 classes 22 people, you can directly students[1] place in students[20], and ensure that will not overcrowding.

As you can see, we get the index that the corresponding element needs to be deposited, and the corresponding index can be deposited according to different classnumber. That

Deposit the frequency of number 0 in count[0], number 1 in count[1].

Then the next step is to convert the acquired frequency to the corresponding index.

`for (int i = 0; i < R; i++) { count[i + 1] += count[i];}`

Since the starting index of the corresponding Classnumber is saved, Count[0] is 0.

The next step is to save the data. But a secondary array is required.

`Student[] aux = new Student[students.length];for (int i = 0, length = students.length; i < length; i++) { aux[count[students[i].getClassNumber()]++] = students[i];}`

After each data deposit, you need to push the index forward one bit, and the index in the count[] array always points to the next location that will be deposited.

The final write-back can be.

`System.arrayCopy(aux, 0, students, 0, students.length);`

Extended

The core of the key index sort is to associate the key itself with the index of Count, and in the example above, because Classnumber is a natural index, it doesn't need to be associated.

However, most of the situations we encounter in our usual use are not like this, we may need to classify the characters according to character, and how to deal with such special cases.

For characters, there is still a relatively simple way of processing in Java, where char is automatically converted to the corresponding number according to ASCII or Unicode code. It is also possible to use char directly for subscript.

What about the more complex strings? For example, Chinese provinces do not follow the code, but are grouped directly by their name. You need to do some special processing.

We know that characters can be converted to and from numbers because there is a corresponding alphabet, either ASCII or Unicode code. Similarly, we can define our own code table.

`public class Alphabet { private String[] alphabet; private int size; public Alphabet(String[] alphabet); public String toStr(int index); public int toIndex(String s); public boolean contains(String s); public int R() { return size; }}`

You can complete the conversion accordingly.

Low-priority string ordering

After understanding the key index sort, the low priority is not difficult to understand. It is simply to sort each character of the string in the same order as the key index, and the length of the string is maxLength;

`User[] aux = new User[users.length];for (int i = maxLength - 1; i >= 0; i--) { int[] count = new int[125]; /*对待如身份证号 固定位15 18位的这两种类型,可以设置超出界限的数值. 在15位-18位的统计时, 跳过所有为15位的, 不再次进行统计*/ String tempStr; for (int j = 0; j < users.length; j++) { //System.out.println(users[j] + ",j:" + j + ",i:" + i); if ((tempStr = users[j].getsGroup()).length() - 1 < i) { count[1]++; } else { count[tempStr.charAt(i) + 2]++; } } for (int j = 0; j < 124; j++) { count[j + 1] += count[j]; } for (int j = 0; j < users.length; j++) { if ((tempStr = users[j].getsGroup()).length() - 1 < i) { aux[count[0]++] = users[j]; } else { aux[count[tempStr.charAt(i) + 1]++] = users[j]; } } System.arraycopy(aux, 0, users, 0, users.length);}`

Sorted from low to high, the results are stable because of the key index ordering.

Here, the original algorithm is fine-tuned, can support the ordering of unequal strings, the processing idea is that for the current I more than the character length, then placed in the position of count[1], the rest is deferred with +2.

High-priority string ordering

High-priority string ordering is the same based on the key index sort. What is different is that high precedence is not required for string lengths, can be unequal, and high precedence is handled alphabetically by alphabetical order.

` public class MSD {private static int insertbound = 15; private static string[] aux; private static int ToChar (String str, int index) {return index < Str.length ()? -1:str.charat (index); } public static void sort (string[] str) {} private static void sort (string[] str, int lo, int hi, int d) { if (Hi <= lo + insertbound) {//switch to insert sort, from Lo to hi return; } int[] Count = new int[123+2]; for (int i = lo; I <= hi; i++) {Count[tochar (str[i], D) + 2]++; } for (int i = 0; i < 124; i++) {count[i + 1] + + count[i]; } for (int i = lo; I <= hi; i++) {Aux[count[tochar (str[i], D) + 1]++] = Str[i]; } for (int i = lo; I <= hi; i++) {str[i] = Aux[i-lo]; } for (int i = 0; i < 123; i++) {sort (str, lo + count[i], lo + count[i + 1]-1, D + 1); } }}`

This highlights a better way to handle unequal strings, which is the ToChar () method, since the return value may be-1, thus unifying +2 for the purpose of advancing forward.

But there are still several problems:

Careful analysis, if the strings are not the same, the number of recursion will be quite many, and as R is the example of 123, the size of the encoding set increases, processing time will also increase particularly fast. Therefore, it is necessary to pay attention to the size of the character set, so as not to reduce the speed to unimaginable degree.

Sorting a string with a large number of the same prefixes can be quite slow because it is difficult to switch to the insertion sort. But still need statistics frequency, conversion, recursion.

Its efficiency constraints are precisely what happens often, so we need an algorithm that avoids all of these flaws.

The three-string sort.

It can be understood as a standard quick sort of deformation in a particular scenario of a string.

`public class Quick3sort {private static int tochar (String str, int index) {return index < Str.length ()-1 : Str.charat (index); } public static void sort (string[] str) {sort (str, 0, str.length-1, 0); } private static void sort (string[] str, int lo, int hi, int d) {if (lo >= hi) {return; } int cmp = ToChar (Str[lo], D); int i = lo + 1, lt = lo, gt = Hi, t; while (i <= gt) {t = ToChar (Str[i], D); if (T < CMP) {Exchange (str, lt++, i++); } else if (T > CMP) {Exchange (str, I, gt--); } else {i++; }} sort (str, lo, lt-1, D); if (cmp > 0) {sort (str, LT, GT, D + 1); } sort (str, GT + 1, hi, D); } private static void Exchange (string[] str, int l, int r) {String temp = str[l]; STR[L] = Str[r]; STR[R] = temp; }}`

In a three-way quick sort, the main thing is to deal with the inefficiency caused by the large number of duplicate keys encountered in the quick sort, and similarly, in the fast ordering of the three-way string, in order to deal with a large number of strings with the same prefix, and the size of the character set R is small.

Assuming that all strings of the first 10 characters are the same, in the first split, before and after two sub-array length is 0, when the intermediate array is generated, the string is not moved, but only compared to the corresponding position of a single character, can very quickly filter out the prefix, and then processed.

Word Lookup Tree

Larger space is needed for storage, with the advantage of being quite fast.

The time to find the hit is proportional to the length of the key being looked up.

Finding misses requires only a few characters to be found.

In the binary tree comparison method is taken to compare the different keys, greater than the right, less than in the left, the depth of the layer, find the time required mainly by the height of the tree, the impact of comparison itself.

In this paper, we give a simple idea of implementation, based on the implementation of the two-fork tree, the basic implementation of the word search tree is quite simple.

Take ASCII 128-bit as an example.

`public class WST { private static final int R = 128; private static Node root; private class Node { private Object val; private Node[] next = new Node[R]; }}`

The core data structure is a node array and can also be seen, where you do not need to store the string, but instead the character of the string is separated from the index of the node array one by one corresponding. Implemented the key.

First of all, it has the advantage of being very friendly both in terms of space and time, for a large number of strings that contain common prefixes.

But as r grows, empty links become more and more, and these invalid links occupy a lot of space. Without considering the space consumption, this data structure is undoubtedly the best data structure that is currently seen.

In my view, however, there is often a need to take into account space resources for the conditions currently in place. Therefore, its superiority is not very strong.

However, from another point of view, it will find its strong. The need for this data structure is also the following API.

`String longestPrefixOf(String s); // 以s为前缀的最长的键Iterable<String> keysWithPrefix(String s); //所有以s为前缀的键Iterable<String> keysMath(String s); //所有和s 通配符匹配`

In previous data structures, the cost of implementing these interfaces was quite high.

The difference is that a node is not actually deleted in the word lookup tree, there are two forms of a lookup miss, one is not finding the corresponding character, and the other is finding the last character of the string by tree, but value is null. Both of these represent misses.

In the word lookup tree, the time is optimal for both hit and miss lookups, and space, for all cases where the string is short:

The average number of required links is: RN (R is the character set size, n is the total number of strings stored)

For longer strings:

The average number of required links is: RNW (w is the string length);

So when you use words to find a tree, be sure to grasp the character set characteristics and the character of the string itself.

Three-word Search tree

In order to solve the problem of large space consumption encountered in the word search tree, so that this data structure, in the three-way word search tree, the space consumption is not related to R, it is implemented in a manner similar to the implementation of the two-tree, it is different, without the concept of key, the key itself is disassembled into characters, in this way Represents a key.

`private class Node { Object val; Node left, mid, right; char c;}`

And its lookup path is, if C hit, and did not reach the end of the string, the middle key mid-look down, find miss, greater than right, less than left. The same can be implemented in the word lookup tree API, but the search speed is lower than the former.

is to take a balance between time and space consumption.

The advantage of a three-to-word search tree is that large alphabets and character frequencies in large alphabets are very uneven.

As for improvements: You can set the root node, or the first few nodes, depending on the size of the requirements and R. Changes to R to find tree.

But there is still a problem, sub-string lookup, in the database lookup is a more common operation, like '%xxx% '; or a simple Ctrl + F search string. Finds a string that contains some substrings. But some of the previous implementations have been helpless in this respect.

SUBSTRING lookup (KMP algorithm)

In the simplest case, if you need to find a string in a text, it is often used to look up from start to finish, where the matched string is called a pattern string and needs to match every character of the pattern string, if it does not match, the pattern string is the beginning. For text strings, the match lookup proceeds from the beginning of the matching string.

This approach is known as brute force law. Implementation method One:

`public static int search(String pat, String text) { int M = pat.length(); int N = pat.length(); for (int i = 0; i < N; i++) { int j; for (j = 0; j < M; j++) { if (text.charAt(i + j) != pat.charAt(j)) { break; } } if (j == M) { return i; } } return N;}`

Very simple algorithm, also very well understood. This algorithm is very useful in most cases, the lookup time does not need too much, but it does not apply to all cases, its search time depends on the characteristics of the text and pattern string characteristics. In the worst case, NM lookups are needed to solve the problem.

The following brute force law provides another slightly different approach.

`public static int search(String pat, String text) { int M = pat.length(); int N = pat.length(); int i, j; for (int i = 0; i < N && j < M; i++) { if (text.charAt(i) == pat.charAt(j++)) { ; } else { i -= j; i = 0; } } if (j == M) { return i - M; } return N;}`

There is only one for loop, which replaces the loop with the fallback of the index. This gives us a new way of thinking, which is to use fallback to control the index, position.

But in fact, when we get this pattern string, we've got a certain amount of information to help us solve the problem.

For such a string ababac, when we advance to str[5] [C], the match fails, we can continue to push the string forward, do not have to fallback text I, if the match failed string is B, then we just check the text of the next character is a can, do not need to fallback , making the best use of known information.

It is not just the use of strings that have already been checked, but even the use of mismatched characters as a piece of information. No two checks are required. Improve efficiency.

But how to use it?

KMP algorithm

The personal feeling of the KMP algorithm is still very difficult to understand, in a picture-and-control case it took three or four days to understand. Let's look at the more difficult versions first. I will not paste the text. This is explained only by the code. If you find it difficult, it is recommended to look at algorithm four P498 and its contextual.

There is also a more interesting noun: Determine the finite automatic state machine (DFA).

`/* Define an DFA two-dimensional array here, R refers to the character set size used, and m refers to the length of the pattern string. The value stored in the array is called the current state. The state value is also from 0~m.*//* so the need to understand this state is worthwhile, for the string Ababac, when each check a character, and match, the state value forward one, it is not difficult to find that if the six letters are matched, then the pattern string matches the success, the state value at this time, the natural is 6. The starting value of the status value is 0, which means that none of the characters match. So you need to match from scratch. */int[][] DFA = new Int[r][m];/*pat refers to a pattern string. From here can be seen, make dfa[' a '][0] = 1, that is, match to this time, enter the state 1, at this time to indicate a successful match, because Pat's first character is ' A ', can also find Dfa[pat.charat (j)][j] = j + 1; It's the core here. */dfa[pat.charat (0)][0] = 1;/* From here it is not difficult to see that the outer loop and M times, that is, each time will determine the corresponding position of the character */for (int X = 0, j = 1; j < M; J + +) {for (int c = 0; c < R; C + +) {/* Here the array is assigned, that is, at the position of the J character, if the character used to match C, should return to which state value. That is, the current position in the current match failed, which state should be entered. But how to determine by x? */dfa[c][j] = dfa[ C][X]; }/* is here to indicate the success of the match, the status value should be J+1;dfa[][j] J at this time also can be understood as state. Indicates that the current in the status J match succeeds, should go into j+1, next. */dfa[pat.charat[j]][j] = j + 1; /* Critical is what x means, and x indicates that if the next match fails, the current character's position should be in the state. For Ababac, if the match fails at the third position B, the previous one is a, X = dfa[' a '][0]. At this time in the state one. Regardless of what the third match value is, whether it is a C, the only certainty at this point is to start counting again at least from state 1, because a already matches successfully. */* After this, go to the next loop, dfa[c][j] = dfa[c][x], because the previous bit a has been matched toWork, at this time in the state 1, in the case of State 1, if you encounter C at this time should be in a state? Before this cycle, we have verified the ABA three bit, if in the case of State 1, then encountered B, should enter the state dfa[' B '][1]. The concept of the state here indicates how many bits have been matched at this point. *///////////////* Enter a new round, the original value of x stored in the case of a successful match, at least should be in the state of a few, in the case of State 1, then the state 1, if the match B The next value is B, should be in the state, at least at this time should be in dfa[' B '][1]; If the example changes to ABACAC, then the dfa[' C '][1] is 0 and needs to go back to the starting point. *////////* This sentence is simple to say: In the case of State X match to Pat.charat[j], at this time should enter the state? Update x*//* And thanks to this, we have clearly understood the first few components of a character, at which point X is also determined by the first few. */x = dfa[pat.charat[j]][x];}`

It only takes a few lines of code to get the job done.

`public static int search(String text) { int i = 0, j = 0, N = text.length(), M = pat.length(); for (;i < N && j < M; i++) { j = dfa[text.charAt(i)][j]; if (j == M) { return i - j; } } return N;}`

One of the great advantages of this approach is that it does not require a fallback in the input, while ensuring that the search speed is still at a linear level in the worst case. In fact, it is not uncommon to find a pattern string that contains the same amount of repetition in a large amount of duplicate text.

It is more suitable for use in an indeterminate input stream that cannot be rolled back, or that a fallback requires a large cost.

SUBSTRING lookup (BM algorithm)

In the algorithm used for substring lookups, BM is currently considered the most efficient substring lookup algorithm.

In the KMP algorithm, the complexity of the algorithm is O (N + M), while in the BM algorithm, O (n/m);

The understanding of BM algorithm is much simpler than the KMP algorithm. It is done by pre-matching strings by moving forward values.

For text strings f I n d i n a H a Y S T a C K n n e e D L E in Find pattern string N e e d l e;

Each match starts at the far right of the pattern string. The characters in the text string can be skipped to the maximum extent possible. TEXT[5] is n! = pat[5], the pattern string encounters the first n in the No. 0 bit from right to left, so the pattern string moves 5 bits to the right.

`F I N D I N A H A Y S T A C K N N E E D L E N E E D L E`

Relative to the 5th bit of the pattern string, the next match is started, that is, text[10] is s! = Pat[5], and in the Pat string, there is no corresponding character matching it. So move forward 6 bits.

`F I N D I N A H A Y S T A C K N N E E D L E N E E D L E`

And so on for the next comparison.

In this case, the pre-shift of the pattern string can be obtained by preprocessing.

`/*R为字符集大小, right为在匹配到对应字符时 文本字符串的指针应该前移的位数.*/right[] int = new int[R];/*j表示模式字符串的指针, 不断左移, i表示文本字符串的指针, 不断右移*/int j, i;/*当在模式字符串中找不到被匹配字符时, i应该前进 j + 1位. 当不匹配时设为 -1, 则自然实现了 j + 1*/for (int c = 0; c < R; c++) { right[c] = -1;}/*M位模式字符串的长度, 因为在判断右移位数的时候, 需要从右向左第一次出现的位置来决定, 因此需要从左向右, 查找, 覆盖*/for (j = 0; j < M; j++) { right[pat.charAt[j]] = j;}`

Through this simple and understandable code, the core of BM algorithm is realized, and the preprocessing is done.

But there is a special case that still needs attention:

`. . . . . . . . E L E . . . . . . . N E E D L E`

The position of D is matched to E, at which point the two bits should be moved to the left, but this is not the case, so the rule is to move one to the right at least equal to zero. Re-match.

`. . . . . . . . E L E . . . . . . . N E E D L E`

Find:

`public static search(String text, String pat) { ... 省略之前的right[] 数组生成. int skip; for (int i = 0; i <= N - M; i += skip) { skip = 0; for (int j = i + M; j >= 0; j--) { if (text.charAt(i) != pat.charAt(j)) { skip = j - right[text.charAt[i + j]]; if (skip <= 0) { skip = 1; } } } if (skip == 0) { return i; } } return N;}`

Substring lookup (Rabin Karp algorithm)

The core idea of this algorithm has also been touched before, similar to the Equals () method, traversing text text, constantly looking for a string with the pattern string equals ().

But there are several problems:

The Equals () of string is to compare each character, and still compare each character in a loop lookup, even less quickly than brute force.

1: The solution, here is to take the hash () value of the pattern string, similar to the calculation of hashcode, as long as the divisor (also known as the size of the hash table) is set to large enough, such as 10 of 20, so the probability of conflict is only 1/10^20, can be ignored.

But here we just need to save the hash value of the pattern string.

`public long hash(String key, int M) { long h = 0; for (int i = 0; i < M; i++) { h = (R * h + key.charAt(i)) % Q } return h;}`

In a text string if you take the same approach, such as 01234 12345 23,456 bits to calculate the hash value, the efficiency is lower than brute force, because not only to traverse, but also to calculate, compare the hash value.

1: Solve, the problem is how to efficiently calculate the corresponding hash value.

If you represent Text.charat (i) with Ti, the values for the first m characters of the start position of text are calculated as follows.

Xi = Ti * r^ (M-1) + ti+1 * r^ (M-2) + ti+2 * r^ (M-1) + ... + T (i+m-1) * r^0

Hash (xi) = xi% Q;

X (i+1) = (Xi-ti * r^ (M-1)) * R + T (i+m) * R^0;

Use C for constants, then:

X (i+1) = C1 * XI-C2 * Ti + T (i+m);

This allows the corresponding hash value to be obtained within the constant time.

The last thing you need to do is to compare the hash values to see if the strings are the same, and if you want to go further, you can compare the strings with the same hash value.

The advantage of this algorithm is that the time in the space and worst case is compromised. It does not require additional space, and in all cases the fetch time is the same for 7N;

Summarized as follows:

Java's IndexOf () method is brute force, the disadvantage is the worst case O (MN), the KMP algorithm is linear level, and requires additional space, the advantage is that the algorithm does not need to fallback, applicable to the situation of the flow. The BM algorithm also requires additional space, O (n/m), which increases the speed of M, while the RABIN-KARP algorithm requires no additional space and the computation time is linear.

As a result, the use of brute force in conventional situations is sufficient, and several others are subject to availability.

String correlation (sort, word lookup tree, substring lookup)