There is a group of non-daily English vocabulary, I need to calculate in English articles in the most frequent frequency.
So I initially thought of traversing the array, using Substr_count to sequentially calculate the number of occurrences of each word, but this would result in multiple repetitions of the entire article scan. Or the article is broken into words, from the array function to calculate the number of intersections, but still feel not ideal.
Do you have any ideas? This app is actually a keyword extraction.
Reply to discussion (solution)
How to split the group is not good, English into the array is very convenient ah, at least more simple than Chinese
Actually do not understand your needs, purely statistical array_count_values is convenient enough
That means you already have a thesaurus, and now you need to check the number of occurrences of the word store word in the article.
If yes, then you can use the trie algorithm (which I sent)
Just scan the article once and then, of course, first construct the thesaurus
That means you already have a thesaurus, and now you need to check the number of occurrences of the word store word in the article.
If yes, then you can use the trie algorithm (which I sent)
Just scan the article once and then, of course, first construct the thesaurus
Why is it better to save a thesaurus? Mysql,json,xml, a pure array?
If an article has 5kb, the thesaurus has 1000 words, then put the 1000 words one by one foreach, matching this article,
Mysql_query,
Json_decode ()
Simplexml_load_file ()
Array
Which is more efficient and saves resources (Cpu,ram)?
5KB is not likely to have 1000 words, all of them are articles of the word?
Even if 1000, the amount is not very large, remove the repetition should be much less, once the array intersection is enough
My train of thought is that the article splits into the word array, array_count_values the statistics and removes duplicates two functions
Then extract the number of parts (too few times do not match the meaning of it?) ), the rest is very few, and the existing thesaurus to find the intersection is enough
Although the landlord is a reference to English vocabulary, but your algorithm is limited to English words, it is meaningless.
5KB is not likely to have 1000 words, all of them are articles of the word?
Even if 1000, the amount is not very large, remove the repetition should be much less, once the array intersection is enough
My train of thought is that the article splits into the word array, array_count_values the statistics and removes duplicates two functions
Then extract the number of parts (too few times do not match the meaning of it?) ), the rest is very few, and the existing thesaurus to find the intersection is enough
What you're saying makes sense.
But I think the simple problem is simple, since he speaks English, so to think, there is no need to spend too much time thinking about the algorithm
If he says mixed languages, I guess I'm just not going back to this post, huh?
Although the landlord is a reference to English vocabulary, but your algorithm is limited to English words, it is meaningless.
Reference 4 Floor Snmr_com's reply: 5kb is unlikely to have 1000 words, all of them are articles of the word?
Even if 1000, the amount is not very large, remove the repetition should be much less, once the array intersection is enough
My train of thought is that the article splits into the word array, array_count_values the statistics and removes duplicates two functions
Then extract the number of times ...
version of the prefix tree did not understand, for the time being selected several times to scan the article to achieve
A simple example
Include ' ttrie.php '; class Wordkey extends Ttrie { function B () { $t = Array_pop ($this->buffer); $this->buffer[] = "$t"; }} $p = new Wordkey; $p->set (' Qin Shihuang ', ' B '), $p->set (' Luoyang ', ' B '); $t = $p->match (' Qin shihuang East patrol Luoyang '); Echo join (', $t);
Qin ShihuangEast Patrol
Luoyang