Questions about algorithms and php extensions

Source: Internet
Author: User
Ask questions about algorithms and php Extensions. This post was last edited by ShadowSniper from 2012-11-2723: 30: 22. I have a requirement to use php extension: search for words contained in a string in the word table. For example, the vocabulary is as follows: --------------- You and I ask questions about algorithms and php extensions.
This post was edited by ShadowSniper at 23:30:22 on February 27,. I have a requirement to use php extension:

Search for words contained in a string in the word table.

For example, the word table is as follows:
---------------
You
He
Me
They
You guys
We
---------------

Enter one sentence: You and I are good friends with them.

Requirements:
1. find the words contained in the word table.
2. you can use parameter control.
(1) Long term takes precedence (if you find "they", you don't need to find "they" again ")
(2) return all ("He" and "they" are found)
(3) search for only short words (only for "him ")
3. you can use parameter control.
(1) returned in word table order
(2) return words in the order of words in the sentence)
4. you can add dimensions based on requirement 3 through Parameter Control. return results in positive or reverse order.


Question 1: Which algorithm will be used more efficiently?
If a word is returned in the word table order, scan the word table row by row and use the kmp algorithm to retrieve words in the sentence. this is easier.

If the results are returned in the order of words in a sentence, first search for the words in the vocabulary order, record the position of each word found in the sentence, and then sort the words by the index.
But this step has more sorting. how can I not sort it? I first thought of using standard word segmentation, dividing sentences into words, and then querying word lists. However, there is a problem that is hard to ensure that we may need to use a third party for standard word segmentation, the words to be separated do not necessarily conform to the rules of words in our vocabulary. Because we manually input word lists. This problem may occur. There is a word in the word table: "Silk", but the standard word segmentation method regards "silk" and "Silk" as two words. In this way, we cannot find it.

I do not know any better methods. please advise.


Question 2: The vocabulary size is about 20000 at present, not very large. I want to put all in the memory. But do I use the emalloc provided by zend to apply for memory or use c native malloc to apply for memory or use a third-party memory database system?

The advantage of applying for memory using emalloc is that php memory management can help manage memory. However, the disadvantage is that it needs to occupy a large amount of php memory.
I am not quite clear about the advantages of applying for memory using malloc, but it seems that it will also occupy the php memory and it is difficult to manage the memory, because it may cause php memory overflow.
The advantage of using a third-party memory database is that it does not occupy the php memory, but may be slightly less efficient. For example, redis is used. The connection to redis needs to be implemented in the php extension and closed.
------ Solution --------------------
I guess it's a php-fpm process once?
------ Solution --------------------
Such small-scale applications can be implemented directly using php code.
Test data for a php code
There are a total of 52938 words in the dictionary (modern Chinese commonly used word table en.txt)
The length of the file to be matched is 19415 bytes ))
Match 2300 words
146.212 milliseconds
The dictionary loading takes 146.172 milliseconds.
Conversion speed: 15,730 words/second
The algorithm uses a single array trie.

Although not very fast, it is acceptable.

Of course, the write extension is better, and the speed should be higher.
Since it is an extension, it is also a dynamic link library, and the memory will reside when it is loaded. So your second question is not a problem.
The development process of an application like yours is roughly as follows:
1. complete all functional modules in c/c ++, with io being independent software
2. write php extension functions to call the interface to be exposed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.