Develop full-text indexes in Python in Linux

Source: Internet
Author: User
Develop a full-text index in the Linux environment Python-general Linux technology-Linux programming and kernel information. The following is a detailed description. As the amount of information increases, it is increasingly important to efficiently locate specific information. This topic will discuss the full-text index field and focus on the author's public domain indexer module.
  
This topic will explore my Python project: indexer module, and has a special purpose: I have always tried my best to learn more like you, this column welcomes your comments and ideas. Your contribution will appear in the project or in future columns. In general, I hope this column will reflect the reader's interests and knowledge, not just me. Let's get started.
  
I hope that the indexer module can prove useful to readers even in earlier versions. This full text indexer can be used as a separate utility or a module of a large project. The design illustrates the reusable object-oriented coding principles and the basic principles of text indexing (extremely subtle and rich themes. Although Knuth once advised us that "immature optimization is the root cause of the problem", the purpose of indexing is to quickly find information, so this column will also discuss performance issues.
  
The indexer module is derived from a university looking for a good way to find a large number of text and HTML Help documents. It is also a small motive I want to use the letters, news, and archive I have accumulated over the years. It is very easy. indexer makes it difficult for users to specify search conditions using Rule expressions when locating documents and quickly execute them. Although some commercial software or free tools can do similar work, most of them are for Web indexing. They (even through LOGALHOST) need CGI interfaces, which are difficult to install and use. There is only one software designed for Python (with a different focus ). On the other hand, indexer must be designed to be easy to use. Of course, some of the earlier and more complex software features are more powerful, but indexer is designed to expand functions without losing its ease of use features.
  
   About search engines
The column name "Full Text indexer" belongs to another broader category-"Search Engine ". For most users, search engines are usually used to locate URLs and WWW. Indeed, WWW must be the largest public document library in human history. Its informal organizational structure makes it very necessary to have a good search engine. In addition, other document sets, especially files on local hard disks, will also benefit from search engines. Hierarchical file systems and file naming rules are good methods, but their development is far from enough. Sometimes you only need to find documents containing certain information.
  
Half of Internet search engines are located on the documents whose content is to be indexed. Although there are many ways to find many related URLs, there is no algorithm listing each valid URL. Fortunately, when indexing local documents (as in the current version of indexer), it is very easy to find all documents that are clearly and clearly located. When you want to index the sub-tree of some directories instead of others, the document list can be precise without omission.
  
There are two different policies when designing a local search engine. You can view the actual content of a file during the search to determine whether it is consistent with the search criteria. You can also prepare a database containing the content of each file and then search the database without searching for the file itself. The advantage of the first method is that it is always accurate and can always accurately locate where you want the content. The biggest disadvantage of this special method is that the speed is extremely slow, and the cost is high if you perform many searches.
  
The advantage of the second method is that it will be much faster if implemented properly. A search passes a Summary of the searchable features of a document, so you do not have to read these documents again for subsequent searches. This reduces the search cost. The disadvantage is that the database may not be synchronized with the file content and requires periodic re-indexing, which will occupy additional space (1% to 100% of the size of the indexed text, depends on the search features and design options ).
  
Examples of this special method include "File Find" in Windows, find and grep tools for Unix-like operating systems (with kfind in KDE) PMSeek.exe iN OS/2, Find Object, and Finder in MacOS 7 ". Database methods include "Fast Find" in Microsoft Office, "QuickFinder" in Corel Office, "Sherlock" in MacOS 8 +, and locate utility for Linux. BeOS "Find" is a combination of the two methods, but its function is very limited-non-full-text search. Other operating systems also provide similar utilities.

There are many different methods to specify the content to be searched. The following are examples:
  
The frequency of occurrence of words indicates the frequency of appearance of a series of words in a document. It is assumed that for a given search, the conditions found in the document appear frequently, which is a "better" match.
  
There is a complex relationship between words and phrases that can appear in a Boolean search. For example, in "(spam AND eggs) OR (ham AND cheese)", the combination in any parentheses will meet the conditions without having to include words separated from the other.
  
Rule expression search meets (as complex as possible) the pattern. This method is more conducive to finding highly structured data than identifying conceptual content.
  
Phrase search only allows multi-word search. Although Rule expression search can complete the same search, it can be done in a simpler system.
  
Search for a series of words or phrases that are "close" to each other. How close is usually a search option.
  
For dry-word search, sometimes the dry-word search is not the whole word. Consider "run", "runner", "running", and "runs" as related words and find them all, instead of trying to search for each word that meets the criteria separately, this approach is sometimes very effective.
  
Concept search identifies words with similar meanings to query documents with similar themes. This type of search requires that some dictionaries be integrated into the search engine.
  
The search method can query irregular spelling, especially for English. Searching does not use spelling in text, but converts words into regular spelling based on their pronunciation. Then compare the converted text with the converted search condition.
  
   About indexer
Indexer uses the database of words that appear. Version 0.1X (alpha Beta) can only search for documents with fixed full-text word structures. As an option, search can sort qualified documents according to the frequency of occurrence of search words and compare the document length. Indexer can be extended in different ways. Some extension methods are logical and direct, while others are more complex.
  
Boolean capabilities are simple and are already being implemented as planned. Because indexer tracks which documents contain words (and the number of occurrences), it is easy to add logic to rules or include files based on or without search words. In fact, by default, the current function adds AND to each word search. (My intuition is that most of the current searches use this "x AND y AND z" method .)
  
Rule expressions are almost unable to be added to indexer. As far as I know, no database search system has a list of files containing the content that meets the rules. For the sake of practicality, the rule expression needs to be processed in a special way -- so we use grep.
  
Phrase and approximate search are not implemented yet, but implementation is not difficult. Basically, in addition to the frequency of occurrence of each word in each file, you must also collect the offset list of the word in each file. Based on the list, we can infer the phrase and approximation. However, I think this will greatly increase the size and search time of the database.
  
In terms of concept, stem and search may already exist in the existing basic framework, but it takes a lot of work. This method can indeed reduce the size of the database because it only needs to store Regular Expressions instead of changing forms, but word conversion consumes external generic dictionaries and Changing Rule forms.
  
   Indexer Programming
It is recommended that you download the source code of indexer (refer to references after this article ). It has only one file and has detailed annotations, which is equivalent to a programming book.
  
The following are remarks about the program structure. Note that the document is numbered. Each document is associated with an integer "fileid ".
  
Indexer has a Python dictionary. Its keywords are words, and its values are dictionaries. the keywords of this dictionary are "fileid ", the number of times the specified word "fileid" appears in the file. The query efficiency of the Python dictionary is very high, and the attachment of the Link "fileid" and the actual file name is very little.
  
In general, indexer contains an abstract class called GenericIndexer. The most important methods defined in GenericIndexer are add_files () and find (). If the storage mechanism needs to be finalized (most of them need to be), The save_index () method is also important.
  
The reason for abstracting GenericIndexer is that it cannot be instantiated, and its subclass can be instantiated only after further work is completed. The term "abstraction" comes from C ++. In C ++, it can be a formal declaration of classes. In Python, this formal declaration is not made, and the class "abstraction" is only the advice provided by the class developers to their users. Python does this-it does not force data hiding, member visibility, inheritance requirements, and similar practices, but it complies with the specifications at which it will be completed. However, GenericIndexer can effectively enforce its recommendations because many of its methods consist of "raise NotImplementedError" rows. Specifically, _ init _ () calls load_index () of one of the methods of "NotImplemented ().

The derivation of GenericIndexer has different implementation methods in the actual storage index. ZPickleIndexer is the most practical method. zlib and cPickle are combined to store compressed dictionaries. On the one hand, I have created many other SomethingIndexer classes for interest, on the other hand, due to the surprising performance test results (see the Benchmark Test Module. If you want to, the shelve, XML, flat-file, and cPickle classes are ready. It is possible to create NullIndexer derivation to efficiently store each index to/dev/null (although meaningless), and re-index each time a search starts.
  
While implementing load_index () and save_index (), the specific (opposite to "abstraction") SomethingIndexer class inherits SomethingSplitter from the "mixin class. Currently, such a class only has TextSplitter, but other similar classes will continue to appear. SomethingSplitter provides a very important splitter () method, which obtains a text string and splits it into words. This method is much harder than you think; it is very subtle to distinguish between words or not. In the future, I want to create TextSplitter derivatives, such as XMLSplitter, TeXSplitter, merge splitter, and similar classes. Now we try to find text words in a relatively primitive way.
  
The "mixin class" is an interesting concept and is often a good design choice. Classes similar to TextSplitter (or their future Derived classes) often contain useful functions for many specific Derived classes. Similar to abstract classes, the mixin class cannot be directly instantiated (its validity is different from the prohibition. In mixin, I did not propose NotImplementedError .) However, unlike abstract classes
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.