Develop full-text indexes in a Linux environment using Python

Last Update:2017-08-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the Linux system environment, use Python to develop full-text indexes-general Linux technology-Linux programming and kernel information. The following is a detailed description. As the amount of information increases, it is increasingly important to efficiently locate specific information. This topic will discuss the full-text index field and focus on the author's public domain indexer module.

This topic will explore my Python project: indexer module, and has a special purpose: I have always tried my best to learn more like you, this column welcomes your comments and ideas. Your contribution will appear in the project or in future columns. In general, I hope this column will reflect the reader's interests and knowledge, not just me. Let's get started.

I hope that the indexer module can prove useful to readers even in earlier versions. This full text indexer can be used as a separate utility or a module of a large project. The design illustrates the reusable object-oriented coding principles and the basic principles of text indexing (extremely subtle and rich themes. Although Knuth once advised us that "immature optimization is the root cause of the problem", the purpose of indexing is to quickly find information, so this column will also discuss performance issues.

The indexer module is derived from a university looking for a good way to find a large number of text and HTML Help documents. It is also a small motive I want to use the letters, news, and archive I have accumulated over the years. It is very easy. indexer makes it difficult for users to specify search conditions using Rule expressions when locating documents and quickly execute them. Although some commercial software or free tools can do similar work, most of them are for Web indexing. They (even through LOGALHOST) need CGI interfaces, which are difficult to install and use. There is only one software designed for Python (with a different focus ). On the other hand, indexer must be designed to be easy to use. Of course, some of the earlier and more complex software features are more powerful, but indexer is designed to expand functions without losing its ease of use features.

About search engines

The column name "Full Text indexer" belongs to another broader category-"Search Engine ". For most users, search engines are usually used to locate URLs and WWW. Indeed, WWW must be the largest public document library in human history. Its informal organizational structure makes it very necessary to have a good search engine. In addition, other document sets, especially files on local hard disks, will also benefit from search engines. Hierarchical file systems and file naming rules are good methods, but their development is far from enough. Sometimes you only need to find documents containing certain information.

Half of Internet search engines are located on the documents whose content is to be indexed. Although there are many ways to find many related URLs, there is no algorithm listing each valid URL. Fortunately, when indexing local documents (as in the current version of indexer), it is very easy to find all documents that are clearly and clearly located. When you want to index the sub-tree of some directories instead of others, the document list can be precise without omission.

There are two different policies when designing a local search engine. You can view the actual content of a file during the search to determine whether it is consistent with the search criteria. You can also prepare a database containing the content of each file and then search the database without searching for the file itself. The advantage of the first method is that it is always accurate and can always accurately locate where you want the content. The biggest disadvantage of this special method is that the speed is extremely slow, and the cost is high if you perform many searches.

The advantage of the second method is that it will be much faster if implemented properly. A search passes a Summary of the searchable features of a document, so you do not have to read these documents again for subsequent searches. This reduces the search cost. The disadvantage is that the database may not be synchronized with the file content and requires periodic re-indexing, which will occupy additional space (1% to 100% of the size of the indexed text, depends on the search features and design options ).

Examples of this special method include "File Find" in Windows, find and grep tools for Unix-like operating systems (with kfind in KDE) PMSeek.exe iN OS/2, Find Object, and Finder in MacOS 7 ". Database methods include "Fast Find" in Microsoft Office, "QuickFinder" in Corel Office, "Sherlock" in MacOS 8 +, and locate utility for Linux. BeOS "Find" is a combination of the two methods, but its function is very limited-non-full-text search. Other operating systems also provide similar utilities.

There are many different methods to specify the content to be searched. The following are examples:

The frequency of occurrence of words indicates the frequency of appearance of a series of words in a document. It is assumed that for a given search, the conditions found in the document appear frequently, which is a "better" match.

There is a complex relationship between words and phrases that can appear in a Boolean search. For example, in "(spam AND eggs) OR (ham AND cheese)", the combination in any parentheses will meet the conditions without having to include words separated from the other.

Rule expression search meets (as complex as possible) the pattern. This method is more conducive to finding highly structured data than identifying conceptual content.

Phrase search only allows multi-word search. Although Rule expression search can complete the same search, it can be done in a simpler system.

Search for a series of words or phrases that are "close" to each other. How close is usually a search option.

For dry-word search, sometimes the dry-word search is not the whole word. Consider "run", "runner", "running", and "runs" as related words and find them all, instead of trying to search for each word that meets the criteria separately, this approach is sometimes very effective.

Concept search identifies words with similar meanings to query documents with similar themes. This type of search requires that some dictionaries be integrated into the search engine.

The search method can query irregular spelling, especially for English. Searching does not use spelling in text, but converts words into regular spelling based on their pronunciation. Then compare the converted text with the converted search condition.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Develop full-text indexes in a Linux environment using Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Develop full-text indexes in a Linux environment using Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support