Thinking of Cassandra Data model

Source: Internet
Author: User
Keywords Aliyun Amazon data center Intel Cloud security supercomputer data center cloud security
Tags aliyun apache application applications based class cloud cloud security

The cloud storage solution with NoSQL technology is maturing, but the idea of SQL database dominates. This can lead to the use of SQL to resolve NoSQL data modeling issues. Based on the author's Cassandra Project development and project implementation experience, this paper makes some brief guidance on NoSQL modeling. The article does not have the specific syntax guidance data modelling, these questions please refer to the Apache website. Cassandra has been upgraded to the top-level project of the Apache organization. Currently maintain a very fast development upgrade speed. Apache has released the 1.0beta version. Different versions of Cassandra have different characteristics, so different data models can be used in the development perspective.




The
Cassandra model can be represented by a 5-D array, which corresponds to the 5 dimensions of Keyspace, columnfamily (supercolumns), Key, Columnes, and value. In getting the data using get syntax, get key is the maximum data dimension, columns is the minimum get data dimension.





get data to understand this: because the key value index is the default index, only accurate lookups are supported. If we want the fastest way to find a class of students, the simplest method is to model class code for key. If we want the fastest to find a grade of students, of course, the grade code for the key model.





Such a solution poses a problem. In the relational database-oriented applications, we are accustomed to using conditional statements to find data, from the entire data set to obtain the required data. For example, where grade = "3" or where class = "302". SQL can be used to create a simple model of application. And Cassandra doesn't seem to be.





the truth is true. In Cassandra version 0.6, it is difficult to solve this simple problem as SQL, unless you use code. If this is the only way to think about it, it also means that we are still using the experience and methods of SQL to think. Experience of relational database application system such as: 3 classic paradigm, even PK, FK, dictionary table, schema and so on without a leading thinking. So we need to change that mindset.





Cassandra is not a relational database, and its CF design is based on a type of query. This view has two layers of understanding: 1, in the perspective of application to consider data storage; 2, the key to the data model is in the second level, that is, columnfamily. For example: To query A grade of students, you should use the grade as CF to organize the data, query a class of students, you should use the class as CF to organize the data.





seems to be incredible. This is true for people with rich SQL experience. Why use the NoSQL technology instead of the SQL database when doing system outline design? The simplest reason is that the SQL database is not competent. Because SQL databases maintain complex data constraint relationships in large amounts of data, the performance overhead affects the business.





above is only a simple case, do not need to pay too much attention to. Before using the NoSQL database, we should take a good look at the deep-rooted SQL database +OLTP mode of application thinking. There is a cap theory in the NoSQL field: a distributed system can not meet the same consistency, availability and partitioning fault-tolerant three requirements, up to two at the same time.





This theory is faced with a variety of suspicion and pressure, but in the absence of better theoretical support, it is no harm to use it. Cassandra is labeled as Turfco in this theory and can be used for sexual and zonal tolerance. Availability (availability): Each operation is always able to return within a certain time, that is, the system is available at all times. Partitioning tolerance (Partition tolerance): In the case of a network partition (such as a disconnected network), separate systems can function properly (these ideas are also difficult to understand in SQL).





different applications have different emphases. In traditional OLTP applications, we must focus on acid characteristics. In a non transactional distributed system, our focus is on the usability of the system, such as some of the nodes in the system, and still need to ensure the availability of the business. Here is a professional term for "BASE".





with the continuous improvement and upgrading of the Cassandra, we have a greater degree of freedom in processing the data model. Cassandra 0.7 compared to previous versions, the most important update is the Level 2 index, in addition to bug fixes and performance improvements. The key index is a primary index and the column index is a level two index. The same problem. Using Level 2 index can be more convenient to solve.





in Cassandra 0.7, the index of the column values (columns values) is called the "Level Two index", which is different from the index of the key in the column cluster (columnfamilies). The secondary index allows us to query the value and automatically create it in the background without causing a read-write blocking. A second-level index can use a range query. In this way, we can improve the original scheme. You can use more intuitive, maintainable, and programmatic solutions.





when using a Level two index, it is important to note that a nested loop filter is used in the so-called primary (main description) dataset. means that you cannot use only one index condition in a query statement, such as: Class >302, otherwise you will only get: No indexed columns A in index clause and operator EQ.





you need to add a primary descriptive index when using level two indexes. If you add a grade index, use the condition where grade = "3" and Class > 302; This case seems to be not very well understood. From another point of view, the level two index actually takes out all the data sets that satisfy the equation and loops through it. The things that used to be done with code, now just have to be indexed.





's understanding of the Cassandra model gives us a more elegant solution to the problem of application development. Typically, Cassandra can only provide a value query for the key. What if you need a scope query for key? There are two methods: 1, modify the model, the key dimension sank to the column dimension, and then build a two-level index query, 2, the data distribution strategy, if the use of ordered distribution strategy such as: Orderpreservingpartitioner, you can use key range query.





data modeling is not only influenced by development tools, but also needs to consider the type of application. OLAP applications use redundant data models more than OLTP applications. One reason is that the granularity of the same object is different, and there is a reason for the time cost in exchange for the space cost. Therefore, these factors should be considered in the process of Cassandra Modeling.





Author: Guangzhou Xu Zhenyu Electronic Development Co., Ltd.

(Responsible editor: Lu Guang)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.