Cassandra distributed database in detail, part 2nd: Data structure and reading and writing

Last Update:2014-08-17 Source: Internet

Author: User

Tags cassandra crc32

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Cassandra Data storage structure

The data in Cassandra is divided into three main types:

Commitlog: The main record of the data submitted by the client and the operation. This data is persisted to disk so that the data is not persisted to disk and can be used for recovery.
Memtable: The user writes the data in memory form, and its object structure is described in detail later. In fact there is another form of binarymemtable This format is currently Cassandra not used, here is no longer introduced.
Sstable: Data is persisted to disk, which is divided into data, Index, and Filter three formats.

Commitlog data format

Commitlog data only one, that is, according to a certain format of the number of bytes, write to the IO buffer is timed to be brushed to disk persistence, in the previous profile in the configuration file has been mentioned in the commitlog of the persistence of two ways, one is periodic one is Batch, their data format is the same, but the former is asynchronous, the latter is synchronous, the data is brushed to disk frequency is not the same. The related class structure diagram for Commitlog is as follows:

Figure 1. Related class structure diagram of Commitlog

Its persistence strategy is also very simple, that is, the object submitted by the user is rowmutation serialized into a byte array, and then the object and a byte array to the Logrecordadder object, called by the Logrecordadder object commitl Ogsegment Write method to complete the writing operation, the code of this write method is as follows:

Listing 1. Commitlogsegment. Write

public commitlogsegment.commitlogcontext write (rowmutation rowmutation, Object  Serializedrow) {Long currentposition = -1l;             ...             Checksum Checkum = new CRC32 (); if (Serializedrow instanceof dataoutputbuffer) {Dataoutputbuffer buffer = (dataoutputbuffer) serializedrow                 ;                 Logwriter.writelong (Buffer.getlength ());                 Logwriter.write (Buffer.getdata (), 0, Buffer.getlength ());             Checkum.update (Buffer.getdata (), 0, Buffer.getlength ());                 } else{assert Serializedrow instanceof byte[];                 byte[] bytes = (byte[]) Serializedrow;                 Logwriter.writelong (bytes.length);                 Logwriter.write (bytes);             Checkum.update (bytes, 0, bytes.length);  } logwriter.writelong (Checkum.getvalue ()); ...  }

The main function of this code is that if the current columnfamily ID is not serialized, a Commitlogheader object will be generated based on this ID, record the position in the current Commitlog file, and serialize the header. Overwrite the previous header. This header may contain multiple IDs for the columnfamily that are not serialized to disk in the rowmutation. If it already exists, write the serialized result of the Rowmutation object directly to the Commitlog file buffer and add a CRC32 check code. The format of the Byte array is as follows:

Figure 2. Commitlog file Array structure

Each of the different columnfamily IDs is included in the header, and the purpose is to make it easier to judge that the data is not serialized.

The purpose of Commitlog is to recover data that has not been written to disk, how can it be recovered based on the data stored in the Commitlog file? This code is in the Recover method:

Listing 2. Commitlog.recover

public static void Recover (file[] clogs) throws ioexception{... final commitlogheader          Clheader = Commitlogheader.readcommitlogheader (reader);             int lowpos = commitlogheader.getlowestposition (Clheader);             if (Lowpos = = 0) break;             Reader.seek (Lowpos);                     while (!reader.iseof ()) {try{bytes = new byte[(int) reader.readlong ()];                     reader.readfully (bytes);                 claimedCRC32 = Reader.readlong ();                 }  ...                 Bytearrayinputstream Bufin = new Bytearrayinputstream (bytes);                 Checksum Checksum = new CRC32 ();                 Checksum.update (bytes, 0, bytes.length);             if (claimedCRC32! = Checksum.getvalue ()) {continue;}            Final Rowmutation rm = Rowmutation.serializer (). Deserialize (new DataInputStream (Bufin)); }  ...  }

The idea of this code is that the header of the deserialization Commitlog file is the Commitlogheader object, looking for the smallest rowmutation position in the header object that is not written back, and then taking out the order of the Rowmutation object based on that position. The data is serialized, then deserialized into the Rowmutation object, and the data in the Rowmutation object is then fetched back into the memtable instead of directly to disk. The operation of the Commitlog can be used to express clearly:

Figure 3. Commitlog data format change process memtable in-memory structure

Memtable in-memory data structure is relatively simple, a columnfamily corresponds to a unique memtable object, so memtable is mainly to maintain a concurrentskiplistmap<decoratedkey, columnfamily> type of data structure, when a new Rowmutation object is added, memtable just see if the structure is <decoratedkey, columnfamily> collection already exists, If not, add in, some words take out the Key corresponding to the columnfamily, and then merge their Column. The memtable-related class structure diagram is as follows:

Figure 4. Memtable-related class structure diagram

The data in the memtable is brushed to the local disk based on the corresponding configuration parameters in the configuration file. These parameters have been described in detail in the previous article.

There are many references in the previous Cassandra write performance is very good, good reason is because Cassandra write to the data is written to memtable, and memtable is the data structure in memory, so Cassandra is written memory, basically describes a ke How the Y/value data is written to the memtable data structure in Cassandra.

Figure 5. Data is written to memtable sstable data format

Each add a piece of data to the memtable, the program will check whether the memtable has been written to disk conditions, if the condition satisfies this memtable will be written to disk. Let's take a look at the classes involved in this process. The correlation class is shown in Figure 6:

Figure 6. Sstable Persistence class structure diagram

After the memtable condition is met, it creates a Sstablewriter object and then takes out all the <decoratedkey, columnfamily> collections in Memtable, columnfamily The serialization structure of the object is written to Dataoutputbuffer. Next Sstablewriter is written to Date, Index, and Filter three files according to Decoratedkey and Dataoutputbuffer respectively.

The Data file format is as follows:

Figure 7. sstable Data File Structure

The data file is to organize the file according to the byte array mentioned above, which is written to the index file, and what data is written in index?

In fact, the Index file is a record of all keys and the key corresponding to the revelation address in the Data file, 8 shows:

Figure 8. INDEX file Structure

The index file is actually an index file of key, which is currently indexed only for key, and not indexed for Super column and column, so matching column is slower than key.

When the Index file finishes writing the filter file, the contents of the filter file are the serialized results of the Bloomfilter object. Its file structure is shown in 9:

Figure 9. Filter file Structure

Bloomfilter object actually corresponds to a Hash algorithm, the algorithm can quickly determine that a given Key is not in the current sstable, and each sstable corresponding Bloomfilter object is in memory, the Filter file indicates Bloo A copy of the Mfilter persistence. Three files correspond to the data format can be used to express clearly:

Figure 10. Sstable Data Format Conversion

View larger image

After the three files are finished, one more thing to do is to update the previously mentioned Commitlog file, telling Commitlog that the current columnfamily is not the smallest place to write to the disk.

In the process of memtable writing to disk, the memtable is placed in the Memtablespendingflush container to ensure that the data stored in it can be read correctly during reading, which is also introduced when the data is read later.

Writing of data

There are two steps to writing data to Cassandra:

Find the node where this data should be saved
Write the data toward this node. The client writes a piece of data that must specify Keyspace, columnfamily, Key, Column Name, and Value, as well as specify Timestamp and the security level of the data.

The main related classes involved in data writing are as follows:

Figure 11. Insert Related class diagram

The write logic for the big thing is this:

When Cassandraserver receives the data to be written, it first creates a Rowmutation object and then creates a Querypath object that holds the columnfamily, Column Name, or Super Column Nam E. All data submitted by the user is then saved in the map<string, columnfamily> structure of the Rowmutation object. The next is to compute the node in the cluster that should hold the data according to the Key being submitted. The rule for this calculation is to convert the Key to token and then find the closest node to the given token based on the binary lookup algorithm in the token ring of the entire cluster. If the user has specified data to hold multiple backups, the nodes in the Token ring that are equal to the number of backups will be returned sequentially. This is a basic list of nodes, after which Cassandra will determine if the nodes are working properly and if they are not looking for a replacement node. There is also the need to check if a node is starting, and this node is to be considered, and eventually form a list of target nodes. Finally, the data is sent to these nodes.

The next step is to save the data to Memtable and Commitlog, and the return of the results can be asynchronous or synchronous depending on the security level specified by the user. If a node fails to return, the data will be sent again. Is the sequence diagram that writes data to memtable when Cassandra receives a piece of data.

Figure 12. Timing diagram for Insert operation

View larger image

Reading of data

Cassandra write performance is better than read performance, why write performance than read much better? The reason is that Cassandra's design principle is to make writing faster and more convenient and sacrificing read performance. The fact is that, just look at the Cassandra data storage form can be found, the first is to write to Memtable, and then the memtable in the data brush to disk, but also is the order to save the data is not checked the uniqueness, and is only write not deleted (delete rule is described later), Finally, multiple sstable files of the sequential structure are merged. Isn't that every step that makes Cassandra write faster. This design thinks about how it will affect reading. First, the complexity of the data structure, memtable and sstable in the data structure is certainly different, but the return to the user is certainly the same, which will inevitably be transformed. Second, the data in multiple files, the data to be found may be in memtable, or in some sstable, if there are 10 sstable, then in the 10 sstable each to find it, although the use of Bloomfilter algorithm can quickly determine which Contains the specified key in a sstable. It is also possible in the memtable to sstable conversion process, this is to check again, that is, where the data may exist, it is necessary to find out where to go. The data found may have been deleted, but there is no way to take it.

The following is a related class diagram of reading data:

Figure 13. Reading related class diagrams

The logic read according to the class diagram above is that Cassandraserver creates the Readcommand object, which holds all the conditions that the user must specify to get the record. This thread is then handed to weakreadlocalcallable to search for data in the Columnfamilystore object, including Memtable and sstable. Assemble the found data into a Row to return, so that a query process is over. This query logic can be represented by the following sequence diagram:

Figure 14. Query Data Timing Diagram

View larger image

In a place to note is that the key corresponding to the columnfamily to be at least three places to query, the first one is memtable, the second is Memtablespendingflush, this is to convert memtable to sstable before A temporary memtable. The third one is sstable. In the sstable query is the most complex, it will query the key first and each sstable corresponding to the filter to compare, this filter holds all the Sstable file contains all the key Hash value, the HSAH algorithm can quickly determine The specified key is not in this sstable, the value of this Filter is stored in memory, so that you can quickly determine the key to query in that sstable. Next, in the corresponding index of the sstable to query the corresponding position of key, from the previous index file storage structure know, index in the data file to save the specific Offset. , get this Offset can directly to the data file to take the corresponding length of byte data, deserialization can achieve the target columnfamily. Because of the way the Cassandra is stored, the value corresponding to the same key may exist in more than one sstable, so until all sstable files have been found and merged with the results found in the previous two memtable, the value that is to be queried is ultimately.

In addition, the worst-case scenario described earlier is that the query has no cache at all, and of course Cassandra provides a multilevel cache for query operations. The first level is directly cached for the query results, and the configuration entry for this cached setting is keyspace below the rowscached. The query will be first found in this Cache. The second-level cache corresponds to the index file of the sstable, which can be cached directly to query for key indexes. This configuration is also in the keyscached below Keyspace, if the Cache can hit, it will save the Index file an IO query. The last level of cache is the mmap of disk files and memory files, which can improve the operational efficiency of disk IO, due to the size limit of the index, if the Data file is too large can only use this technology on 64-bit machines.

Deletion of data

From the previous data write rule can imagine, Cassandra want to delete data is a troublesome thing, why say this? The reasons are as follows:

There are multiple data points that can be saved at multiple nodes.
Data structure has a variety of data will be written in Commitlog, Memtable, sstable, their data structures are different.
Data timeliness is inconsistent because it is a cluster, so data transmission between nodes must have delay.

In addition to these three points there are some other difficulties such as sstable persistent data is stored sequentially, if you delete the middle section, the data how to move, these problems are very difficult, if the design is unreasonable, performance will be very poor.

This section discusses how Cassandra solves these problems.

The interface for deleting data in Cassandraserver has only one remove, the following is the source of the Remove method:

Listing 3. Cassandraserver.remove

public void Remove (string table, String key, Columnpath Column_path,           long timestamp, consistencylevel consistency_ Level) {        checklogindone ();        Thriftvalidation.validatekey (key);        Thriftvalidation.validatecolumnpathorparent (table, column_path);        Rowmutation rm = new rowmutation (table, key);        Rm.delete (New Querypath (Column_path), timestamp);        Doinsert (Consistency_level, RM);    }

Careful comparison with the Insert method shows that only one line is different: The Insert method calls Rm.add and here is rm.delete. So what did this Rm.delete do? Here is the source code for the Delete method:

Listing 4. Rowmutation. Delete

public void Delete (Querypath path, long timestamp) {... if (columnfamily = = null)        columnfamily = Columnfamily.create (Table_, cfname); if (Path.supercolumnname = = NULL && Path.columnname = = null) {Columnfamily.delete (Localdeletetime, Time        Stamp); }else if (path.columnname = = null) {Supercolumn SC = new Supercolumn (Path.supercolumnname, Databa            Sedescriptor.getsubcomparator (Table_, cfname));            Sc.markfordeleteat (localdeletetime, timestamp);        Columnfamily.addcolumn (SC);            }else{Bytebuffer bytes = bytebuffer.allocate (4);            Bytes.putint (Localdeletetime);        Columnfamily.addcolumn (Path, Bytes.array (), timestamp, true); }    }

The main logic of this code is that if you delete a column under the specified key, then the Vlaue of the column corresponding to the key is set to the current system time, and the Ismarkedfordelete property of column is set to TRUE, if All Column except this Key sets the deletion time period attribute of this columnfamily. This new piece of data is then executed according to the Insert method.

This idea is now obvious, which is to update the data that already exists in the Concurrentskiplistmap collection by setting the same Key to correspond to different data. This method is really good, it can achieve the following purposes:

Simplifies the operation logic of the data. Unify add, modify, and delete logic.
Solved the three difficulties mentioned earlier. Because it is to modify the data in the way that the data is produced. A little dose of his own medicine's meaning.

But this still has two problems: This only modifies the specified data, it does not delete the data, there is sstable is based on the data in memtable, it is likely to appear in different sstable to save the same data, this How to solve? Indeed, Cassandra does not delete the data you want to delete, Cassandra just filters out the record ismarkedfordelete is TRUE before you return the query data. It ensures that the data you delete can no longer be found, and that you don't need to worry about when you really want to delete it. Cassandra the process of deleting data is very complex, the real deletion of data is in the process of sstable compression, the purpose of sstable compression is to the same Key under the corresponding data are unified into a sstable file, so that the same data in many places to solve the problem. During compression, Cassandra determines which data should be deleted according to the rule.

Compression of the Sstable

Data compression is actually an extension of the data write Cassandra, the previous description of the data write and data read There are some limitations, such as: in the process of writing, the data will continue to be a certain size of the memtable brush to disk, so the constant brush, will produce a lot of the same size Sstable file, it is impossible to go on like this indefinitely. Also in the process of reading, if too many sstable files will inevitably affect the efficiency of reading, more sstable will affect the query. There is also a Key for the Column that is scattered across multiple sstable also be problematic. And we know that Cassandra's deletion is also a write operation, which also handles these invalid data.

In view of the above problems, it is necessary to merge the sstable files, the ultimate goal of merging is to combine all the value corresponding to a Key. The composition of the combination, the modification of the modification, the deletion of the delete. The data corresponding to this Key is then written in a contiguous space of the sstable.

When the compressed sstable file is controlled by Cassandra, the ideal number of sstable files is 4~32. When a new sstable file is added, Cassandra calculates the average sstable file size of the current period when the new sstable size is sstable times the average 0.5~1.5 size Cassandra will call the compression program compression sstable The result is a re-establishment of the Key index. This process can be described by:

Figure 15 Data compression

Summarize

In this paper, we first describe the main storage format of data in Cassandra, including the format of data in memory and disk, and then describe the way Cassandra processes the data, including the addition, deletion and modification of data, which is an operation in essence. Finally, the data compression is introduced.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More