Optimize the matching of huge keywords

Last Update:2017-08-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Origin of the problem

There was a problem at work in the last few days:

There are 600,000 short message logs, 50 words per Treaty, 50,000 keywords, 2-8 words in length, and most of them Chinese. Ask to extract all the keywords contained in these 600,000 records and count the number of hits for each keyword.

This article provides a complete description of my implementation and how I can optimize my tasks to run for 10 hours to 10 minutes or less. Although the implementation of the language is PHP, but this article introduces more ideas, should be able to give you some help.

Original –grep

Design

At the beginning of the task, my small mind immediately turned up, log + keyword + statistics , I did not think of their own code implementation, but the first thought of Linux under the commonly used log grep statistics command.

grepThe use of commands is no longer mentioned grep 'keyword' | wc -l , using the number of information that can be conveniently used to count keyword hits, exec() while PHP's functions allow us to invoke the Linux shell command directly, although there is a security risk when executing a dangerous command.

Code

On the pseudo code:

foreach ($word _list as $keyword) {    $count = intval (EXEC ("grep ' {$keyword} ' File.log | Wc-l "));    Record ($keyword, $count);}

Running on an old machine, saying that the old machine was really inefficient and ran for 6 hours. The latest machine is estimated to be 2-3 hours in the future, the new machines are used for optimization, and the demand has changed, and the body has just begun.

original, primitive in ideas and methods .

Evolution – The Regular

Design

After the poor, the next day the product put forward a new idea, said later want to put a data source access, the message is transmitted in the form of data flow, and is no longer a file. And also asked for the real-time information statistics, I want to write data to the file and then the idea of statistics also overturned, for the scalability of the project, now the statistical object is no longer a whole, but to consider to take n a single message to match.

At this point, I had to sacrifice the most traditional tools-regular. The regular implementation is not difficult, each language also encapsulates the regular matching function, focusing on the construction of patterns (pattern).

Of course, the pattern building here is not /keyword1|keword2|.../ difficult, | with the key words can be linked together.

Regular small pits

Here are two small pits that are encountered in use:

The regular pattern length is too long to cause the match to fail: PHP has a backtracking limit to prevent all processes from being consumed by the stack, eventually causing PHP to crash. A pattern that is too long causes PHP to detect too many backtracking, break matches, and test the default setting with a maximum mode length of about 32000 bytes. The php.ini php.ini ini_set(‘pcre.backtrack_limit’, n); parameter is the maximum number of backtracking limits, the default value is 1000000, modified or used at the beginning of the script pcre.backtrack_limit Setting it to a larger number can increase the maximum mode length of a single match. Of course, the keywords can be counted in batches (I used this =_=).
The pattern contains special characters resulting in a large number of warning: the matching process found PHP reported unknown modifier 乱码 a large number of warning:, / carefully check the keywords preg_quote() found in the keyword, you can use The function filters the key words again.

Code

On the pseudo code:

$end = 0; $step = $end, $pattern = Array (),//The pattern is first split into multiple small blocks while (Array_sli < count ($word _list)) {    $tmp _arr = CE ($word _list, $end, $step);    $end + = $step;    $item = implode (' | ', $tmp _arr);    $pattern [] = Preg_quote ($item);} $content = file_get_contents ($log _file), $lines = explode ("\ n", $content); foreach ($lines as $line) {    // Use each small block pattern to match for    ($i = 0; $i < count ($pattern), $i + +) {        Preg_match_all ("/{$pattern [$i]}/", $line, $match) ;    }    $match = Array_unique (Array_filter ($match));    Dealresult ($match);}

In order to complete the task, the bite-bullet process ran all night. When I found myself running for nearly 10 hours the next day, my heart was crashing ... It was too slow to reach the requirements at all, and I started thinking about changing the method.

When the product changed the keyword strategy, replaced some keywords, asked to rerun again, and said that it will continue to optimize the keywords, I completely negate the existing scheme. Absolutely can not use keyword to match information, such a one with all the keywords to match, efficiency is intolerable.

Evolution, demand and the evolution of fulfillment

Awakening – Breaking words

Design

I finally began to realize that I had to compare the information to the keywords. If I use key words to create a hash table, with the words in the message to the hash list to find, if found that the match hit, so not to achieve O (1) efficiency?

But a short message, how can I split it into exactly the word to match it, participle? Participle also takes time, and my keywords are some non-semantic words, the construction of thesaurus, the use of Word breaker tool is a big problem, and finally I 拆词 think.

Why is it called splitting, and I consider splitting a sentence into 所有可能的 words with brute force. such as 我是好人） (can be broken down 我是、是好、好人、我是好、是好人、我是好人） (and so on, my keyword length is 2-8, so the number of detachable words will increase rapidly with the sentence length.) However, you can use punctuation, spaces, modal words ( 的、是 such as, etc.) as a separate sentence into small phrases and then split the words, will greatly reduce the amount of words to be removed.

In fact, the participle is not a complete implementation of the latter is replaced by a method, but a very possible concept of implementation, writing this article with pseudo-code to achieve a bit, for everyone to reference, even if not in the matching keyword, used in other places is also possible.

Code

$str _list = getstrlist ($msg), foreach ($str _list as $str) {$keywords = GetKeywords ($STR); foreach ($keywords as $keyword) {//directly through the hash implementation of the PHP array for quick Find if (isset ($word _list[$keyword])) {Rec        Ord ($keyword);    }}/*** the short sentence from the message */function getstrlist ($msg) {$str _list = array (); $seperators = Array (', ', '.     ', ' the ', ...); $words = Preg_split ('/(? <!^) (?!    $)/U ', $msg);    $str = Array ();            foreach ($words as $word) {if (In_array ($word, $seperators)) {$str _list[] = $str;        $str = Array ();        } else {$str [] = $word; }} return Array_filter ($str _list);}    /*** out each word from a short sentence */function getkeywords ($str) {if (count ($STR) < 2) {return array ();    } $keywords = Array (); for ($i = 0; $i < count ($str), $i + +) {for ($j = 2; $j < 9; $j + +) {$keywords [] = Array_slice ($str , $i, $j); Todo limit do not exceed the maximum length of the array}} return $keywords;}

Results

We know that utf-8 one of the Chinese characters takes three bytes, and in order to split each character containing both Chinese and English, the split() simple function cannot be used.

This preg_split('/(?<!^)(?!$)/u', $msg) is used to break up two characters by regular matching to two characters '' , while (?<!^)(?!$) two parentheses are used to limit the capturing group to the first one, Nor is it the last one (it is also possible not to use these two capturing group qualifiers // , which is used directly as a pattern to cause the split result to have an empty string entry before and after each). The concept and usage of capturing groups is visible in my previous blog. The capture group and the non-capturing group in PHP regular

Because there is no real realization, but also do not know how efficient. Each short message about 50 words or so is estimated to be about 10 words, and 200 words will be removed. Although it will break out a lot of meaningless words, but I believe that efficiency is never low, because of its high efficiency hash, and even I think it may be more efficient than the ultimate method.

Ultimately, this scenario is not used because it requires a higher sentence, the separator is not OK, the most important thing is that it is not elegant ... This method I do not want to achieve, statistical markers and mood words and other activities appear to be a little cumbersome, and feel that the removal of a lot of meaningless words feel efficiency wasted.

Awakening, consciousness and the awakening of thought

Final Level –trie Tree

Trie Tree

So I came to find Google Help, search for a large number of data matching, some people put forward the use of trie tree, did not expect to learn the trie tree has come in handy. I have just introduced the trie tree, in the spatial index- 字典树 quadtree in this section, you can check it.

Of course, also for the lazy people copied the explanation I was then (see can skip this section).

A dictionary tree, also known as a prefix tree or trie tree, is an ordered tree that holds associative arrays, where the keys are usually strings. Unlike a two-fork lookup tree, a key is not stored directly in a node, but is determined by the position of the node in the tree. All descendants of a node have the same prefix, that is, the string corresponding to the node, and the root node corresponds to an empty string.

We can analogy the characteristics of a dictionary: When we look in the dictionary by huang Pinyin to find the word dangling (), we will find that it is near huang the pronunciation, maybe the tone is different, and then go forward, we will see the pronunciation prefix huan Word, and then forward, is hua the pronunciation prefix of the word ... Their pronunciation prefixes h qu hua huan huang are. We find the part of the abc...xyz h prefix, based on the order in which we ha he hu find hu it, and then the part of the prefix that is found ... huang Finally, we will find that the longer the pronunciation prefix, the more accurate the search, the dictionary-like tree structure is the dictionary tree, is also the prefix tree.

Design

So how does the trie tree match the keyword? Here is a picture to explain the trie tree matching process.

Among the key points:

Constructing the trie tree

Split the keyword into a single preg_split() character using the function described above. If 科学家 it is split 科、学、家 into three characters.
After the last character, add a special character ` , this character as the end of a keyword (the pink triangle in the figure), with this character to identify a keyword (otherwise, we do not know the match 科、学 to two characters when the match is not counted as successful).
Check that the root has the first character ( section ) node, if there is this node, to 步骤4 . If not, add a node with a value at the root 科 .
Check and add 学、家 two nodes in turn.
Add ` the node at the end and continue with the insertion of the next keyword.

The

We then take 这位科学家很了不起！ The example to initiate the match.

First we split the sentences into single characters 这、位 、... ;
The first character 这 is queried from the root, and there is no keyword starting with this character, the character "pointer" is moved backwards until a character node 科 under the root is found;
Then under the 科 node to find the 学 value of the node, when found, the result subtree depth has reached 2, the shortest length of the keyword is 2, 学 at this time need to find under the node whether there is ` , finding means that the match succeeds, returns the keyword, and moves the character "pointer" back, and if not found, continues to look for the next character under this node.
So traversing, until finally, all matching results are returned.

Code

The complete code I've put on GitHub: Trie-github-zhenbianshu, put the core here.

The first is the design of data structure tree node, of course it is also the most serious:

$node = Array (    ' depth ' = $depth,//depth, used to determine the number of hits    ' Next ' = = Array (        $val = = $node,//Here the hash of the PHP array is borrowed from the underlying real Now, accelerate the discovery of sub-nodes ...    );

Then is the insertion of the child nodes at the time of the tree build:

This is where the child node is inserted into the node, so it is passed in as a reference to the private function insert (& $node, $words) {         if (empty ($words)) {            return;        }        $word = Array_shift ($words);        If the child node already exists, continue inserting the        if (isset ($node [' Next '] [$word]) into the child node {            $this->insert ($node [' Next '] [$word], $words)        ;}            if the else {///Sub-node does not exist, the constructed child node inserts the result            $tmp _node = Array (                ' depth ' = = $node [' depth '] + 1,                ' next ' = = Array (),            );            $node [' Next '] [$word] = $tmp _node;            $this->insert ($node [' Next '] [$word], $words);}    }

Finally, the operation of the query:

You can also use a global variable to store the characters that have been matched to replace the $matchedprivate function query ($node, $words, & $matched) {        $word = Array_shift ($ words);        if (Isset ($node [' Next '] [$word])) {            //If there is a corresponding sub-node, place it in the result set            Array_push ($matched, $word);            Depth to the shortest keyword, you can determine whether to the ending            if ($node [' Next '] > 1 && isset ($node [' Next '] [$word] [' next ']) {                return true;            }            return $this->query ($node [' Next '] [$word], $words, $matched);        } else {            $matched = array ();            return false;        }    }

Results

The result is, of course, gratifying, so matching, processing 1000 data only takes about 3 seconds. The Java colleague tried it, and Java processed 1000 data in 1 seconds.

Here to analyze why this method is so fast:

Regular matching: To use all the keywords to the information matching the number of matches key_len * msg_len is, of course, will be optimized, but the basis so that the optimization efficiency can be imagined.
And the trie tree is AAA AAA... msg_len * 9(最长关键词长度 + 1个特殊字符) The most efficient time hash lookup, that is, the longest keyword similar, the information content is, and the probability of this situation is conceivable.

This is the end of the optimization of the method, from matching 10 per second to 300, and 30 times times the performance increase is huge.

The final level, but not necessarily the ultimate

His path – multi-process

Design

The optimization of the matching method is over, and the goal of optimizing to 10 minutes is not yet achieved, so there are some other ways to think about it.

When we talk about efficiency, it is 并发 inevitable that the next optimization will start with concurrency. PHP is single-threaded (although there are bad multi-threaded extensions), this is not a good solution, the concurrency direction had to be from a multi-process.

So how do you read a log file with multiple processes? There are also a few options here:

In-Process Add log line counter, each process supports incoming parameter n, the process only processes the number % n = n of rows of the log, this hack of the reverse distribution I have used very skillfully, haha. This approach requires process-pass parameters and requires each process to allocate memory that reads the entire log, and is not elegant enough.
Use the split -l n file.log output_pre linux command to split the file into a file of n rows, and then use multiple processes to read multiple files. The disadvantage of this approach is that it is inflexible and needs to be re-sliced when you want to change the number of processes.
Use the Redis list queue to temporarily store logs and open multiple process consumption queues. This method requires another step to write data to Redis, but it expands flexibly and the code is simple and elegant.

Finally, a third approach is used.

Results

There will be bottlenecks in this way, and it should end up on Redis's network IO. I didn't bambo. N processes to challenge the performance of the company's Redis, running 10 processes three or four minutes to complete the statistics. Even if the Redis write is time-consuming, it will be completed within 10 minutes.

At the beginning of the product to match the speed has been the positioning of the hour, when I 10 minutes to take out a new log match results, see the product surprised expression, the heart is slightly cool, haha ~

He can also help you go farther.

Summarize

There are many ways to solve problems, and I think there are a lot of things to know before we solve a variety of problems, even if only we knew what it would do. Like a tool rack, you have to put the tools as much as possible before you can choose the one that is most appropriate when you encounter problems. Then of course you have to use these tools skillfully so that they can be used to solve some weird problems.

工欲善其事, its prerequisite, to solve performance problems, mastering the system-level approach is not enough, sometimes a data structure or algorithm, the effect may be better. Feel yourself in this area is slightly weak, slowly strengthen it, you also encourage.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Optimize the matching of huge keywords

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Optimize the matching of huge keywords

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support