Document directory
- No Query Language
- No reference integrity
- Secondary Index
- Sorting becomes a design decision
- Anti-paradigm
Cassandra has many different models and query methods than RDBMS. It is important to remember these differences.
SQL is a standard query language for relational databases, but Cassandra does not have a query language. But Cassandra does have its own RPC serialization mechanism, thrift. Through the thrift API, you can access the data in it.
Cassandra does not reference the concept of integrity, so it does not have the concept of join. In a relational database, you can specify an external key value in a table to reference the primary key recorded in another table. However, Cassandra does not provide this function. Storing related IDs in other tables is a common requirement, which is still supported, but Cassandra does not have the concept of cascading deletion.
The second index is indeed a useful feature. For example, you need to find the unique ID of a hotel with a certain attribute. in a relational database, the query may be as follows:
Select Region ID from hotel where name = 'clarion midtown ';
When you know the name of the hotel but do not know the ID, you must want to query this hotel. If a relational database receives this query, it performs a full table scan, checks the name column of each row, and finds the desired name. If the table is large, the query may be slow. In this case, the relational database solution is to create an index for this column, which is equivalent to a copy of this part of data to help retrieve data faster. Because the primary ID is already a primary key constraint, the primary key will be automatically indexed, that is, the primary index. Therefore, the index created for the name column is naturally the second index, currently, Cassandra does not support secondary indexes.
To do the same thing in Cassandra, you need to create another column family to store query information. You can create a column family to store the hotel names and map them to the hotel IDs. The second column family actually acts as an explicit secondary index.
The second index is currently being added to Cassandra 0.7, allowing you to create an index for the column value. Therefore, if you want to find all users living in the specified city, the support for the second index will make it unnecessary for you to manually create the second index columnfamily.
- Sorting becomes a design decision
In RDBMS, order by can be used in queries to easily change the order of returned records. The default sorting method is indeed not configurable; by default, records are read in the order they are written. To change the order, you only need to change the query statement and sort any group of columns. But in Cassandra, sorting is different, and it becomes a design decision. The column family definition contains a comparewith configuration element. This configuration specifies the sort of rows when reading data. It cannot be reconfigured during query.
RDBMS limits that you can only sort data based on the data types stored in columns, But Cassandra stores data in byte arrays, so this method of sorting by specified data types does not work. However, what you can do is to treat a column as one of several sortable types (ASCII, long, integer, timestampuuid, Dictionary sorting, etc ). If needed, you can also use your own comparator for sorting.
In addition, Cassandra does not have the order by and group by statements in SQL. There is a query type called slicerange, which will be introduced in chapter 4th. It is similar to order by because it allows flip.
In relational database design, we often emphasize the importance of normalization. However, when Cassandra is used, this is not an advantage, because its performance is the best only when the data model is back-oriented. In fact, many companies will eventually reverse the paradigm of relational databases for two main reasons. One is the performance. When they perform a large number of join operations on the massive amount of valuable data accumulated over the years, they cannot obtain the required performance, therefore, the database is optimized based on the known query content. This method can eventually work, but it is contrary to the original design intention of the relational database. The final problem is whether the use of the relational database is the best way under such conditions.
The second reason for reverse normalization of relational databases is that the business document structure sometimes needs to be preserved. That is to say, you have a peripheral table that references many external tables. The table data may change over time, but you also need to save the history of the peripheral documents in the form of snapshots. A common example is collection information. You already have the customer and product tables, and think you can reference these tables in the collection information. However, you should not do this because the customer and price information may change, and then you will lose the integrity of the collection information, because changes to these tables appear to have also occurred during collection, which may affect auditing, reporting, or even illegal, and may lead to other problems.
In relational databases, anti-paradigm will disrupt the codd paradigm, and we need to do our best to avoid it. But in Cassandra, anti-paradigm is just in line with the rules. It is unnecessary when the data model is simple, but you do not need to fear it.
The key point is that the method of modeling data and then writing query is no longer applicable. In Cassandra, you should define the query and organize the data around the query. Consider the most basic query path used by the application, and then build the required column family based on the query path.
Critics think this is a very serious problem. However, it is not unreasonable to consider how to query an application when designing a database. In fact, this is usually done in relational databases. If the query method cannot be correctly expected, problems will occur in both Cassandra and relational databases. Of course, the query method may change over time, so you have to update the data. However, this is no different from defining a table in a relational database that requires an error or a new additional table.
For an article about how cloudkick uses Cassandra to store performance monitoring metric data, read here: Click to open the link.