Develop full-text indexes in a Linux environment using Python

Source: Internet
Author: User
In the Linux system environment, use Python to develop full-text indexes-general Linux technology-Linux programming and kernel information. The following is a detailed description. As the amount of information increases, it is increasingly important to efficiently locate specific information. This topic will discuss the full-text index field and focus on the author's public domain indexer module.

This topic will explore my Python project: indexer module, and has a special purpose: I have always tried my best to learn more like you, this column welcomes your comments and ideas. Your contribution will appear in the project or in future columns. In general, I hope this column will reflect the reader's interests and knowledge, not just me. Let's get started.

I hope that the indexer module can prove useful to readers even in earlier versions. This full text indexer can be used as a separate utility or a module of a large project. The design illustrates the reusable object-oriented coding principles and the basic principles of text indexing (extremely subtle and rich themes. Although Knuth once advised us that "immature optimization is the root cause of the problem", the purpose of indexing is to quickly find information, so this column will also discuss performance issues.

The indexer module is derived from a university looking for a good way to find a large number of text and HTML Help documents. It is also a small motive I want to use the letters, news, and archive I have accumulated over the years. It is very easy. indexer makes it difficult for users to specify search conditions using Rule expressions when locating documents and quickly execute them. Although some commercial software or free tools can do similar work, most of them are for Web indexing. They (even through LOGALHOST) need CGI interfaces, which are difficult to install and use. There is only one software designed for Python (with a different focus ). On the other hand, indexer must be designed to be easy to use. Of course, some of the earlier and more complex software features are more powerful, but indexer is designed to expand functions without losing its ease of use features.

About search engines

The column name "Full Text indexer" belongs to another broader category-"Search Engine ". For most users, search engines are usually used to locate URLs and WWW. Indeed, WWW must be the largest public document library in human history. Its informal organizational structure makes it very necessary to have a good search engine. In addition, other document sets, especially files on local hard disks, will also benefit from search engines. Hierarchical file systems and file naming rules are good methods, but their development is far from enough. Sometimes you only need to find documents containing certain information.

Half of Internet search engines are located on the documents whose content is to be indexed. Although there are many ways to find many related URLs, there is no algorithm listing each valid URL. Fortunately, when indexing local documents (as in the current version of indexer), it is very easy to find all documents that are clearly and clearly located. When you want to index the sub-tree of some directories instead of others, the document list can be precise without omission.

There are two different policies when designing a local search engine. You can view the actual content of a file during the search to determine whether it is consistent with the search criteria. You can also prepare a database containing the content of each file and then search the database without searching for the file itself. The advantage of the first method is that it is always accurate and can always accurately locate where you want the content. The biggest disadvantage of this special method is that the speed is extremely slow, and the cost is high if you perform many searches.

The advantage of the second method is that it will be much faster if implemented properly. A search passes a Summary of the searchable features of a document, so you do not have to read these documents again for subsequent searches. This reduces the search cost. The disadvantage is that the database may not be synchronized with the file content and requires periodic re-indexing, which will occupy additional space (1% to 100% of the size of the indexed text, depends on the search features and design options ).

Examples of this special method include "File Find" in Windows, find and grep tools for Unix-like operating systems (with kfind in KDE) PMSeek.exe iN OS/2, Find Object, and Finder in MacOS 7 ". Database methods include "Fast Find" in Microsoft Office, "QuickFinder" in Corel Office, "Sherlock" in MacOS 8 +, and locate utility for Linux. BeOS "Find" is a combination of the two methods, but its function is very limited-non-full-text search. Other operating systems also provide similar utilities.

There are many different methods to specify the content to be searched. The following are examples:

The frequency of occurrence of words indicates the frequency of appearance of a series of words in a document. It is assumed that for a given search, the conditions found in the document appear frequently, which is a "better" match.

There is a complex relationship between words and phrases that can appear in a Boolean search. For example, in "(spam AND eggs) OR (ham AND cheese)", the combination in any parentheses will meet the conditions without having to include words separated from the other.

Rule expression search meets (as complex as possible) the pattern. This method is more conducive to finding highly structured data than identifying conceptual content.

Phrase search only allows multi-word search. Although Rule expression search can complete the same search, it can be done in a simpler system.

Search for a series of words or phrases that are "close" to each other. How close is usually a search option.

For dry-word search, sometimes the dry-word search is not the whole word. Consider "run", "runner", "running", and "runs" as related words and find them all, instead of trying to search for each word that meets the criteria separately, this approach is sometimes very effective.

Concept search identifies words with similar meanings to query documents with similar themes. This type of search requires that some dictionaries be integrated into the search engine.

The search method can query irregular spelling, especially for English. Searching does not use spelling in text, but converts words into regular spelling based on their pronunciation. Then compare the converted text with the converted search condition.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.