Calculate the number of times a word appears

Source: Internet
Author: User
There is a group of non-daily English words used to calculate the number of word occurrences. I need to calculate the most frequently used words in English articles.
So I first thought of traversing the array and calculating the number of occurrences of each word in sequence using substr_count, but this would cause repeated scans for the entire article. You can also split the article into words and use the array function to calculate the number of intersections, but still do not think it is ideal.

Do you have any ideas? This application is actually a keyword extraction.


Reply to discussion (solution)

Why is it difficult to split an array? it is easier to input an array in English. at least it is much simpler than Chinese.
In fact, it is convenient to simply count array_count_values.

That is to say, you already have a dictionary. now you need to check the number of times the dictionary words appear in the article.
If yes, you can use the trie algorithm (I sent it)
You only need to scan the article once. of course, you must first construct a dictionary.

That is to say, you already have a dictionary. now you need to check the number of times the dictionary words appear in the article.
If yes, you can use the trie algorithm (I sent it)
You only need to scan the article once. of course, you must first construct a dictionary.

Why is the dictionary saved in a better format? Mysql, json, xml, pure array?

If an article contains 5 kB and 1000 words in the dictionary, match the 1000 words one by one in foreach,

Mysql_query,
Json_decode ()
Simplexml_load_file ()
Array

Which one is more efficient and saves more resources (CPU and RAM )?

5 KB is unlikely to have 1000 words. are all titles?

Even if there are 1000, the amount is not very large, and there should be fewer duplicates to be removed. an array intersection is enough.

My idea is to split the article into word arrays. array_count_values provides two functions: counting and removing duplicates.
Then extract the portion with a certain number of times (is it meaningless if the number is too small ?), Then there will be little left, and it is enough to intersection with the existing dictionary.

Although the author only refers to English words, if your algorithm is limited to English words, it makes no sense.


5 KB is unlikely to have 1000 words. are all titles?

Even if there are 1000, the amount is not very large, and there should be fewer duplicates to be removed. an array intersection is enough.

My idea is to split the article into word arrays. array_count_values provides two functions: counting and removing duplicates.
Then extract the portion with a certain number of times (is it meaningless if the number is too small ?), Then there will be little left, and it is enough to intersection with the existing dictionary.

What you said makes sense.
I think it's just a simple solution to the problem. since he speaks English, he thinks like this and does not have to spend too much time thinking about algorithms.
If he says a mix of multilingual languages, I guess I will not reply to this post either.

Although the author only refers to English words, if your algorithm is limited to English words, it makes no sense.


Reference the reply from snmr_com on the 4th floor: 5kb is unlikely to have 1000 words, all of which are post words?

Even if there are 1000, the amount is not very large, and there should be fewer duplicates to be removed. an array intersection is enough.

My idea is to split the article into word arrays. array_count_values provides two functions: counting and removing duplicates.
Then extract the number of times ......

I didn't understand the prefix tree provided by the version. I chose to scan articles multiple times for the time being.

A simple example

Include 'ttrie. php'; class wordkey extends TTrie {function B () {$ t = array_pop ($ this-> buffer); $ this-> buffer [] ="$ T";}}$ P = new wordkey; $ p-> set ('Qin Shihuang ',' B '); $ p-> set ('luoyang', 'B '); $ t = $ p-> match ('Qin Shihuang Dongxun Luoyang '); echo join ('', $ t );
Qin ShihuangEast Patrol Luoyang

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.