Design of an information visualization Demo (2): Index & Search
Author: angry little FOX ~ 2011-10-29
Blog: http://blog.csdn.net/MONKEY_D_MENG
This is a series of blog posts. For more information, see --> Part 1: Design of an information visualization demo (I): Architecture Design
I. Information Retrieval
The increasing intensity of cloud computing swept across the world, and everyone said cloud. It seems that at night, massive amounts of information and big data are filled with our world. Currently, cloud storage is extremely important when the amount of information is greatly expanded. However, what exactly does it mean when a large amount of information is stored? Answer: convert this data into useful information and knowledge, and use it in a wide range of applications, including business management, production control, market analysis, engineering design, and scientific exploration.
If there is a wealth of data, you will need strong data analysis tools to seek and discover knowledge. The rapid growth of massive data exists in large and large databases. Without powerful tools, understanding them is far beyond human capabilities, it is difficult for decision makers to extract valuable knowledge from massive data. As a result, although the data is rich, the information is poor, and the data information collected in large databases becomes a "Data grave ".
To analyze and process massive data, a prerequisite is that information needs to be organized in a certain way, so that data analysts can quickly find relevant information, this process is information retrieval. How to analyze the data after it is found is generally based on data mining and machine learning theory. This article will not discuss it for the time being.
If you directly search information resource content and match retrieval requests in sequence, this method is very direct, simple, and easy to implement for environments with small data volumes, and the effect will not be too bad, however, in a massive data environment, this scanning process will be very time-consuming and absolutely not desirable.
Sequential scanning of unstructured data is slow, but searching for structured data is relatively fast. The reason is that structured data has a certain structure. We can adopt some tips: use search optimization algorithms to speed up the search. Therefore, we have realized that the key to the problem is: we should try to process unstructured data into structured data. This idea is so natural that it forms the basic idea of information retrieval: extracting part of the information in unstructured data to make it structured, then, an efficient data search algorithm and mechanism are designed to sign in for a relatively fast search. This part is extracted from unstructured data and then reorganized to produce structured information. In the field of information retrieval, we call it an index. For example: we have used dictionaries. If you have no pinyin or radical search tables, it is not a Happy thing to find a phrase in the dictionary, however, if you search by pinyin or radical, you can quickly locate and find this phrase. the retrieval table of Pinyin and radical is a so-called index for the dictionary.
Ii. Search requirements
In the process of software development, the demand is the most outstanding, erratic, it is easy to say that the change changes. I have worked on Dev in MSRA/STC for two months. During this period, the requirements did not form a uniform document, but it was just a word of mouth. Mistransmission and misunderstanding are common events. It is not to say that MSRA/STC development is not standardized, but that the author, as a summer intern, is only doing a Demo and is not paid enough attention. In this case, communication and communication are extremely important. Later, some of our requirements were developed based on the application scenarios of the two Dev, which was finally recognized by everyone. One of them was to provide expression-level information retrieval.
For ease of illustration, the following requirements are simplified: there are thousands of demo-trees, each of which contains hundreds of nodes, and each node contains a Feature sequence. You need to retrieve the following content:
(1) which trees contain a Feature?
(2) which Node contains a Feature?
(3) Feature contained in a Tree
(4) fuzzy search of feature, such as searching the tree or node corresponding to "perstream %"
(5) expression-level search, for example, ([featurename] = perstreambm25f_body | [featureid] = 18) & ([treeid] = 1 | [treeid] = 2)
III,IndexDesign
If you have learned about search engines, or have used the open-source full-text search framework Lucene, it is not difficult to understand why indexes should be created and the basic structure of inverted indexes. Based on the accumulation of the author's information retrieval theory and practice, the following indexing schemes are provided:
(1) inverted index: feature --> tree
(2) inverted index: feature --> node
(3) positive index: tree --> feature
(4) dictionary tree index (Key tree): feature fuzzy search
In this demo, we enter the feature and then retrieve all the corresponding trees or nodes. The original information is that a tree contains several nodes, and each node contains the feature sequence. If no index exists, we need to traverse all the trees, traverse all the nodes of each tree, and match whether the feature sequence of each node contains the feature, until all the trees are scanned. This inefficient retrieval implementation method is so intolerable that even if you have never learned information retrieval, you will think of designing an efficient algorithm to solve this problem.
Then, let's analyze why sequential scanning is slow. In fact, the information we want to search is inconsistent with the information stored in the raw data. The information stored in the raw data is the Node contained in each Tree and the Feature contained in each Node, that is, the Tree ing from Tree to Node and the feing from Node to Feature. However, we want to search for the information about which trees or nodes contain the Feature, that is, the known Feature, and find the corresponding Tree or Node, that is, the feing from the Feature to the Tree or Node. The two are the opposite, so sequential scanning is slow. In this way, if the index stores a feing from Feature to Tree or Node, the search speed will be greatly improved.
As a result, the inverted index method is formed. You don't need to elaborate on the specific implementation. Each Feature corresponds to a Tree List or Node List. If you want to include two Feature trees or nodes at the same time, you only need to take the intersection of the Tree or Node List corresponding to the two Feature. For more than two situations, you know...
Enter the Tree and retrieve the Feature contained in it. You can simply set the index in the forward column and choose Tree --> Feature List.
It is interesting that fuzzy match is supported for Feature retrieval. For example, you enter "PerStream %" to retrieve the Tree or Node corresponding to the Feature prefixed with PerStream. The index solution I provide is a dictionary tree index, which is also called a key tree or a shared Prefix Tree in the data structure. This structure can be used to implement the smart prompt function of the input method. Of course, it only supports prefix matching. If you simply reverse the string first, and then create an index using the dictionary tree, of course, suffix matching can also be implemented, but this...
I have briefly introduced the dictionary tree design in my previous blog article "design of core data structures and algorithms for input methods". For more information, see this article. In short, the dictionary Tree is used to store the ing of the Fuzzy Feature à Tree List or Node List, thus implementing the Fuzzy search of the Feature.
IV,IndexOptimization
Although the index structure is basically determined in terms of concept, there are inverted indexes, forward indexes, and dictionary tree indexes, different people will implement different features based on this idea. For example, if the ing relationship of Feature --> Tree List is used to organize storage, someone will simply use an ArrayList to store the data. If someone saves the data, they will sort the data in ascending order. For another example, if we want to calculate a Tree that contains two Feature values at the same time, we need to calculate the intersection of the two Tree lists. If the Tree List is sorted, the O (n) time can be done, if there is no sorting, it will be SB. Of course, this is just an example. I mean, the same idea seems clear, but the efficiency and effect of different implementations are quite different.
In combination with actual development scenarios, we should design more efficient structures and algorithms as much as possible. My implementation scheme is not an ArrayList and does not require sorting. Instead, BitMap is introduced to store the Tree List. If BitMap [100] = 1 indicates that the Tree with ID = 100 exists in the Tree List, BitMap [99] = 0 indicates that the Tree with ID = 99 does not exist. The reason for this design is to analyze the scenario at that time. There are only thousands of trees and the number of trees is small. You can put them in a BitMap and use subscript to identify the Tree ID, to a certain extent, the node storage space is still available, because the ArrayList corresponding to a Feature may be large. At the same time, there is no need to sort the process. If you want to calculate the intersection of two lists, you can directly use BitMap to do an & operation, which is convenient and quick.
This is an example of Performance Optimization in combination with specific scenarios. Of course, what I'm talking about is just a basic practice. To optimize it, you still need to think carefully during the development process.
So far, the index solution has been available, the index structure has been built step by step, but have you noticed that something is missing? Have you found that all index data is stored in the memory. What is memory? It is an unreliable storage medium. When power is cut off, all data is erased and all data is lost. If the server crashes one day, isn't all the index data gone? Is it necessary to load the original data and recreate all the indexes after the server is restarted? If the data volume is small, how can we deal with massive big data? An index cannot be created for several days. That is to say, we need another link: Index implementation!
The index is to write the indexes in the memory into the disk in a certain format and organizational structure. This format will inevitably be different from the memory, for example, how to store the inverted index table on the disk, how to load services more quickly and reconstruct a complete inverted index table in the memory. Of course, indexing is not that easy to implement. We can refer to a Lucene index format. I did not spend so much time studying it during my two months of practice. The BitMap-based optimization not only improves the retrieval efficiency, but also occupies a small amount of space ~ 2 M. For an internship Demo, it is enough to ignore disaster tolerance.
However, this does not prevent us from continuing to expand our thinking. In the case of massive data volumes, a huge index volume cannot be fully loaded into the memory. In this case, we also need to Cache the most popular indexes currently, it is used to improve the retrieval efficiency, while disks are used as the precipitation layer to store the least frequently used indexes, which is of course a difficult but interesting problem.
V,SearchDesign
Of course, the purpose of creating an Index is to Search. You do not need to know the Index, but you must know the Search. At the beginning, our Search positioning was just a single phrase of Search. At the author's suggestion, we determined the Search requirements at the expression level, for example, ([FeatureName] = PerStreamBM25F_Body | [FeatureID] = 18) & ([TreeID] = 1 | [TreeID] = 2 ), of course, a single phrase is also the simplest expression.
Our expressions are through &, | ,! And () operators are assembled together. However, expressions cannot be directly used for retrieval. To understand the meaning of an expression, you need the parsing component of the expression. The expression parsing algorithm is completed by using the inverse polish expression. The content is left in the third part of this series of blog posts: The algorithm article is described in detail. Please stay tuned to this blog ~
Since the expression is searched, how is the search class defined? I thought about a morning and thought about a clever solution to share with you. The top-level abstract class is Query. The abstract method is search (). WordQuery, NotQuery, and BinaryQuery are derived directly from Query. WordQuery and NotQuery have only one operand, while BinaryQuery has two operands. The AndQuery and OrQuery derived from BinaryQuery correspond to the & and | Operations respectively. A simple class inheritance diagram is shown as follows:
Query |
Search the abstract base class, providing the abstract Method Search () |
Wordquery |
Class derived from query, used to find a given word or phrase, Basic Search Class |
Notquery |
The class derived from query corresponds to the operator! |
Binaryquery |
A query-derived class that contains two query operations. |
Andquery |
A class derived from binaryquery represents two queries and operations, corresponding to the operator && |
Orquery |
A class derived from binaryquery represents two query or operations, corresponding to the operator | |
Query1 & query2 |
Return andquery (query1, query2) |
Query1 | query2 |
Returns orquery (query1, query2) |
! Query |
Return notquery (query) |
According to our design method, for the search expression: ([FeatureName] = PerStreamBM25F_Body | [FeatureID] = 18) & ([TreeID] = 1 | [TreeID] = 2), which corresponds to the following objects:
Query1: WordQuery ([FeatureName] = PerStreamBM25F_Body)
Query2: WordQuery ([FeatureID] = 18)
Query3: WordQuery ([TreeID] = 1)
Query4: WordQuery ([TreeID] = 2)
Expression: AndQuery (OrQuery (query1, query2), OrQuery (query3, query4 ))
VI,SearchOptimization
This was not specific at the time, but I thought about it. When writing a program in C language, we will know that the operators &, | all have short-circuit effects. Based on Short-circuit effects, we can optimize some judgment statements. Our Search is no exception. For retrieval requests: [FeatureName] = PerStreamBM25F_Body & [FeatureID] = 18. If [FeatureName] = PerStreamBM25F_Body Tree does not exist, then there is no need to perform the [FeatureID] = 18 search process. Another example is similar to: [FeatureID] = 1 & [FeatureID]! = 1, the search request can be directly filtered, because the result must be NULL.
Of course, optimization is a profound learning. If you want to do better, you can only continue to practice and think about it ~
VII. Summary
Index & Search, which I personally think is a clever piece of the Demo and the most elegant part of the Code. The entire Demo is made and refined, I also think that writing code is also a very happy and fulfilling thing, especially for the implementation of this demand, he has made full thoughts on demand analysis, solution design, coding implementation, functional testing, and later performance optimization, and has gained some benefits. Of course, this article does not paste any code. It only talks about ideas, only about design, but about the framework. It is big but not virtual, in short.
In the third part of this series of blog posts: In the algorithm section, I will also list a specific parsing algorithm describing the Search expression. Please stay tuned to this blog ~