How to Create a secondary index for a row column is a common question in Cassandra. The following post describes an implementation method. Of course, this is not the only method. For experienced Cassandra users, this post should be of interest. The implementation method described here does not need super column at all, so there will be no complexity and constraints brought about by the use of super column. In addition, it should be pointed out that both cassandra0.7 and later versions will implement native secondary indexes, which makes the following things easier, however, this idea is very effective for considering Cassandra's secondary index, and can still be applied in many scenarios.
First, let's assume the next scenario. There is a container (such as a department), which contains a large number of items (such as users in a department). Each user has any set of attributes, you can also search by context value in the iner. Items can also be a member of another iner, but this situation is not considered here.
In cassandra, one modeling method is to use two columnfamilies (hereinafter referred to as CF ). The first CF will describe the attribute of item, named item_properties. It uses the simplest Data Model of Cassandra and can be found in the row of item_properties through a key, in this example, UUID is used to describe this key. In item_properties, the column is the attribute name of the item, and the column value is the value of the corresponding attribute.
Cf:Item_Properties |
Key:Item_id |
Compare:BytesType |
Name |
Value |
Property_name |
Property_value |
... |
... |
The second CF is for iner, which includes the Items set named Container_Items. The Key column in Container_Items is the row Key in Item_properties. In Cassandra, this is hard to understand. When you use Column Family as a simple relational database table and the row in CF as a record in a relational database, each row can be used as a simple table, or even a connected table. In Container_Items, each column name uses the row Key of Item_Properties, and the column value is filled with the current timestamp during insertion. The rows of Container_Items can grow considerably. Since each column contains about 42 bytes (UUID + timestamp), in versions earlier than Cassandra0.7, a maximum of 40 million Items are allowed, this may be a reasonable limit for users in a department, but if you store Status information (such as Tweets) in this way, it is unacceptable, tweets in one status will definitely exceed this limit. However, this restriction does not exist in Cassandra0.7 and later versions, and a row can store a maximum of 2 billion columns.
Cf:Container_Items |
Key:Container_id |
Compare:TimeUUIDType |
Name |
Value |
Item_id |
Insertion_timestamp |
... |
... |
So far, these are quite basic Cassandra data models. When a person wants to search for Items from the Container Based on the specified attribute value, it will become complicated. To achieve this goal, you need to manage your own indexes, greatly exceeding Cassandra's simplest design. You need to create two columnfamilies to achieve this goal. The first CF stores the actual index, and uses the attribute name in Container_ID and Item_Properties to be indexed as the row Key. The structure is as follows:
Cf:Container_items_property_index |
Key:Container_id + property_name |
Compare:Compositecomparer. compositetype |
Name |
Value |
Composite (property_value, item_id, entry_timestamp) |
Item_id |
... |
... |
The indexing technology described here is a little different from other places, that is, how each column of an index is made up. Cassandra provides a set of Column types for sorting columns in rows. You can specify a sort type when CF is created. Cassandra also allows Custom Column Typesz, as shown in the preceding combination type. A column of the combination type allows us to combine several different components into a column and sort by the column. This allows us to create a unique column, even if the column originally has a non-unique value, there is no problem, but some additional values need to be added to distinguish.
The last problem is what happens when the attribute value needs to be changed and the index value must be updated. The answer is simple. You Insert the new value as a column in the Container_Items_Property_Index column family and delete the old value column. However, Cassandra's final consistency model is related to the lack of transactions. It simply retrieves the previous value from Item_Properties and then updates it, then, the old value in the Container_Items_Property_Index index cannot be deleted reliably. To dothis we maintain a list of previous values
For the specific property of a givenitem and use that to remove these values from the index before adding the newvalue to the index. These are stored in the following CF:
Cf:Container_item_property_index_entries |
Key:Container_id + item_id + property_name |
Compare:Longtype |
Name |
Value |
Entry_timestamp |
Property_value |
... |
These columns are deleted after they are extracted, so these rows will never become too large. In most cases, they will never exceed 1 to 2 columns. If they are modified more frequently, they will be larger. Through this method, it's areally good idea to make sure you understand why this CF is necessary becauseyou can use variations of it to solve a lot of problems with "eventualconsistency" datastores.
Therefore, in general, there are two basic operations: (1) setting the attribute Value for an Item in the Container (2) Obtaining the Items list information that matches the specific Value from the Container. These look like this:
The process of setting the value (property_value) for the Item attribute (property_name) in the Container is as follows:
1. Obtain the timestamp of the newly added object (entry) as the value of the current timestamp;
2. Use the container_id + item_id + property_name as the key and use the get_slice method to find the column information that meets the condition in container_item_property_index_entries;
3. Call the batch_mutate method to complete the following steps in batches:
Delete the information of the columns found in the previous step from container_item_property_index_entries from container_items_property_index;
Delete the previously searched column information from container_item_property_index_entries;
Insert columns to item_properties (the column name is property_name and the value is property_value );
Insert new index records to container_items_property_index;
Insert new value information to container_item_property_index_entries for later modification;
Query the items process in the container according to property_value as follows:
1. Call the get_slice method from the container_items_property_index column cluster with container_id + property_name as the key to find the matched property_value
It seems that there are many steps, but in fact, all the steps are encapsulated by the middleware, and they are invisible outside. You can find the specific implementation of composite column comparison from cassandracompositetypeon GitHub.
Find the simple implementation of the above Indexing Technology in GitHub...
Update:Mike Malone pointsout that, since Cassandra already stores a timestamp along with the columnvalue, that it's redundant to store in the column value as well and can beomitted in the Container_Items
And Container_Item_Property_Index_Entries columnfamilies, which wowould reduce storage space by about 20%.
The translation is ugly and I feel uncomfortable. I just want to stick to it!
You can refer to the original article: http://www.anuff.com/2010/07/secondary-indexes-in-cassandra.html