Summary
This paper mainly introduces the index in Cassandra, materialized view, some knowledge points need to have a basic understanding of Cassandra to understand. such as how the data is distributed in the Cassandra node. If there is something you don't understand, you can read the article before this column. or send me an email and discuss cnstonefang@gmail.com. Why is it called secondary index?
CREATE TABLE User (
ID bigint,
name text,
email text,
PRIMARY KEY (ID)
);
In many documents you can see that Cassandra Index is also known as secondary index. This is the concept of relative primary index. When the above user table is created, primary index is created by default according to primary key, based on the ID column. You can query the user's information based on the ID. But different from relational databases. You can't check your ID against the email. To achieve such a query, you can create secondary index based on email.
CREATE INDEX email_index on user (email);
When you create an index, Cassandra creates a hidden table to store the data
CREATE TABLE email_index (
email text,
ID bigint,
parmary KEY (text,id)
);
The information for this table of secondary index is local aware. And the node's data are stored together. and primary index is global. So when you query according to primary index columns, each node on the Cassandra Ring Loop knows which nodes the data is stored on. But if you query according to secondary index columns. All nodes on the ring Cassandra are unaware of which nodes the data is placed on. You must query all nodes. That's why a lot of people say Cassandra Secondary index is inefficient. But in fact, Cassandra is not going to query it, of course, it will not be so simple and rude. A 1000-node cluster, if all go to check, the query coordinator must not hold up. Secondary Index Query
Cassandra first query all nodes, and for each node, make a local query. When there is no secondary index, the partition key is not specified, because it is necessary to scan all the partition, each patition has a full scan, so Cassandra does not allow such operations. After the secondary index of the corresponding field is created, if you do not specify a partition key, you must bring the ALLOW filtering to the query, but it is not recommended to use it in a production environment.
local queries : Local queries for each node are simpler and more straightforward. According to the secondary index columns value query hidden index table, get primary key, and then query the original table.
cluster query : For all node queries, Cassandra based on the partition keys implements a set of complex algorithms to optimize range scanning queries. Of course, this algorithm is not only for secondary index. Applies to all range scans.
The basic point of this set of algorithms is to iterate the query. Each round will be based on Concurrency_factor to determine how many nodes will be queried, if the returned data is not enough. Concurrenct_factor +1 until the returned result set is enough.
Note that Cassandra queries these nodes based on the token range, so the result set returned is not in a specific order.
Notes
Although Cassandra has optimized the scope query, it is undeniable that the efficiency of the secondary index query is relatively low . The best practice is to be able to bring the primary index condition to the secondary index query. For example partition =xxx,partition in (XX,YY) or token (partition) >= xxx and token (partition) <=YYY use occasion
Applies to a column with many rows (Cassandra does not require that all columns must be stored), and the column has a larger range of values.
On the other hand, these columns do not fit
1. Frequently updated, deleted columns
Cassandra Storage Index Tombstone has a limit of 100K cells, exceeding this limit, the index based column query will fail.
The other index data is also hidden in the table. If you frequently update the deletion of this column of data, not only to write the main table, but also to write hidden tables.
2. The value range is very low (low-cardinality) such as bool type
It makes no sense to index such a column. There are only two partition in the Index table. If the main table has a lot of data, it will
Every partition will be very big.
3. The value range is very high (high-cardinality) such as the above example, an ID corresponding to an email.
If you index the email. Then when we query according to email, there is only one value. Ideally, when we
When you query a node, you find it. In the worst case, you have to check all the nodes to find out.
Look at 2, 3 maybe some people are very confused, the value range is very low not suitable for index, the range is very high also not suitable for index, there is no given a standard, what
The name of the sample range is high, what kind of call value range is low. Let me know how to judge. In fact, there are such problems in many parts of Cassandra that no
Very rigorous, accurate definition. Users need to balance themselves, according to the actual table design, data distribution to do performance analysis, to obtain the appropriate application of the table design. The difference between a new table and a materialized view
To satisfy queries, Cassandra often needs to create new tables, materialized views, and indexes to implement features of the query.
The characteristics of the index have been mentioned above. Creating a new table will have data redundancy, but in a distributed storage system this is perfectly acceptable, compared to the new table with more data maintenance. But there are situations in which views and indexes cannot be solved, such as the low-cardinality situation mentioned above, and the view cannot be solved. Because the view is global, it causes hot-spot, and the view data only has certain fixed nodes.
The other view updates are asynchronous
Cassandra interested in children's shoes can participate in the group (104822562) study together
Reference
http://www.planetcassandra.org/blog/cassandra-native-secondary-index-deep-dive/
Https://docs.datastax.com/en/cql/3.3/cql/cql_using/useWhenIndex.html
Http://www.datastax.com/dev/blog/materialized-view-performance-in-cassandra-3-x
Https://wiki.apache.org/cassandra/WritePathForUsers