Cassandra Data Model

Source: Internet
Author: User
Tags cassandra create index

The Cassandra data model is similar to the model of a relational database, and provides operations in a CQL language very similar to the SQL language.

But the data model of Cassandra is similar to the multi-layer key-value pair structure, which differs greatly from the relational database.

This article is based on: [Cqlsh 5.0.1 | Cassandra 3.11.2 | CQL Spec 3.4.4 | Native Protocol V4]

Directory:

    • Multilayer KV Structure
      • Inquire
      • Sort
      • Polymerization
    • Allow FILTERING
    • Secondary index
Multilayer KV Structure

The Cassandra data model is made up of keyspace (database in a similar relational databases), column family (table in a relational database), primary key (key), and columns (column).

For a column family should not be imagined as a table in a relational database, but a multi-layered key-value structure:

Map<PartitionKey, SortedMap<ClusteringKey, Column>>

We use CQL to describe:

create table table1 (    key1 int,    key2 int,    content text,    PRIMARY KEY ((key1), key2));

In the table above created by CQL (column family), Key1 is partition key, and Key2 is clustering key, Key1 and Key2 are called primary keys (PRIMARY key)

The Cassandra supports more complex table structures:

CREATE TABLE table2 (    pkey1 int,    pkey2 int,    ckey1 int,    ckey2 int,    content text,    PRIMARY KEY ((pkey1, pkey2), ckey1, ckey2));

The data structure at this time can be described as:

Map<pkey1, Map<pkey2, SortedMap<ckey1, SortedMap<ckey2, content>>>>

As a distributed database, Cassandra determines how data is partitioned on each node of the cluster according to partition key. Clustering key determines the sort of data within a partition.

Inquire

The characteristics of the Cassadra data model are described in table2 as an example below.

From the table structure described above using map, we cannot query against non-primary keys (such as data in table2):

SELECT * FROM table2 WHERE content='a'; -- errorSELECT * FROM table2 WHERE pkey1 = 1 AND content='a'; -- errorSELECT * FROM table2 WHERE pkey1 = 1 AND pkey2 = 1; -- right

Typically, you need to be aware of querying for primary key columns such as (Pkey1, Pkey2), Ckey1, Ckey2):

    • Partition key only supports exact query (=, in), cannot make range query (<, >=, <=). Note: Cassandra does not support! = queries.
    • A query involving multiple partition keys must provide the exact value of the predecessor partition key. That is, to query Pkey2, you must provide an exact value for pkey1.
    • Queries involving clustering key must provide exact values for all partition keys
    • The clustering key involved cannot be skipped, and the exact value of ckey1 must be provided to query against Ckey2

(Please do not remember the above conclusion, according to Cassandra's internal data structure is easy to understand what kind of query can be done)

The following is a description of the specific example.

Only partition key is involved:

SELECT * FROM table2 WHERE pkey1 = 1; -- rightSELECT * FROM table2 WHERE pkey1=1 AND pkey2=1; -- rightSELECT * FROM table2 WHERE pkey2=1; -- error

Involves a clustering key:

SELECT * FROM table2 WHERE pkey1=1 AND pkey2=1 AND ckey1>0; -- rightSELECT * FROM table2 WHERE pkey1=1 AND ckey1>0; -- errorSELECT * FROM table2 WHERE ckey1=1; -- error

Multiple Clusterin key involved:

SELECT * FROM table2 WHERE pkey1=1 AND pkey2=1 AND ckey1=1 AND ckey2>0; -- rightSELECT * FROM table2 WHERE pkey1=1 AND pkey2=1 AND ckey2>0; -- errorSELECT * FROM table2 WHERE pkey1=1 AND pkey2=1 AND ckey1>0 AND ckey2=1; --error
Sort

Cassandra support query results are sorted by clustering key, but the sorting function is also very limited:

SELECT * FROM table2 WHERE pkey1=1 AND pkey2=1 ORDER BY ckey1; -- rightSELECT * FROM table2 WHERE pkey1=1 AND pkey2=1 ORDER BY ckey1, ckey2; -- rightSELECT * FROM table2 WHERE pkey1=1 AND pkey2=1 ORDER BY ckey2; -- error

Using the sort function, like a query involving clustering key, you must provide the exact value of all partition keys (= or in operator). This is because Cassandra only supports the sorting of data on a single node.

A sort involving multiple clustering keys must be sorted in the order of clustering key cannot jump, that is ORDER BY ckey1, ckey2 , yes, no ORDER BY ckey2 .

By default, you can only sort in ascending order, that is ORDER BY ckey1 ASC , ORDER BY ckey2 .

This is because Cassandra can only arrange query results in the order inherent to "sortedmap" on each node, but we could customize the collation when we created the table:

CREATE TABLE table2 (    pkey1 int,    pkey2 int,    ckey1 int,    ckey2 int,    content text,    PRIMARY KEY ((pkey1, pkey2), ckey1, ckey2)) WITH CLUSTERING ORDER BY(ckey1 DESC, ckey2 ASC);
Polymerization

Cassandra allows aggregations based on the order in which the primary key columns are defined:

SELECT count(*) FROM table2 GROUP BY pkey1; -- rightSELECT count(*) FROM table2 GROUP BY pkey1, pkey2; -- rightSELECT count(*) FROM table2 GROUP BY pkey1, pkey2, ckey1, ckey2; -- rightSELECT count(*) FROM table2 WHERE pkey1=1 GROUP BY pkey2; -- error

This aggregation enables the aggregation of data on multiple nodes.

For queries with a where condition, Cassandra only supports aggregating data on a single node, meaning that the exact value of the partition key must be provided for aggregation:

SELECT count(*) FROM table2 WHERE pkey1=1 AND pkey2=1 GROUP BY ckey1; -- rightSELECT count(*) FROM table2 WHERE pkey1=1 AND pkey2=1 GROUP BY ckey1, ckey2; -- rightSELECT count(*) FROM table2 WHERE pkey1=1 AND pkey2=1 GROUP BY ckey2; -- errorSELECT count(*) FROM table2 WHERE pkey1=1 GROUP BY pkey2; -- error

This is consistent with the characteristics of the Cassandra query.

Cassandra supports aggregation functions such as Sum, Min, max, Count, distinct, and does not support having statements.

Allow FILTERING

Above we mentioned some queries that are not supported by Cassandra:

SELECT * FROM table2 WHERE ckey1=1;

You can see the Cassandra error message:

Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING

For Cassandra multi-layer nested KV data structure, it is impossible to find the corresponding information through key, can only search all the data to complete this query.

For a 1 million data table, the Ckey1=1 record may account for only 5%, at which point Cassandra can still complete the query in an acceptable time. However, Cassandra does not understand that this query needs to search all the data, so the operator is required to use allow FILTERING to permit Cassandra to scan all data:

SELECT * FROM table2 WHERE ckey1=1 ALLOW FILTERING;

Cassandra official for Allow FILTERING in a very detailed description, you can refer to FILTERING explained.

The author reminds you that allow FILTERING can consume a lot of time and resources, and use this feature sparingly in a production environment.

Secondary index

In addition to the primary key column, we can establish secondary indexes (secondary index) for clustering key and normal values.

A secondary index is an additional key-value mapping that can be used to find data directly from an indexed column.

To create an index:

CREATE INDEX idx_ckey1 on table2(ckey1);CREATE INDEX idx_content on table2(content);

To query using an index:

SELECT * FROM table2 WHERE ckey1=1; -- rightSELECT * FROM table2 WHERE content='a'; -- rightSELECT * FROM table2 WHERE ckey1>0; -- errorSELECT * FROM table2 WHERE pkey1=1 AND ckey1=1; --error

Indexes support only individual, exact queries, do not support range queries, or federate queries with primary keys (other indexes).

Please read the official documentation when you use the an index for index usage scenarios.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.