Currently, two big data storage solutions are available: Row Storage and column storage. There is a lot of competition in the industry for the two storage solutions. The focus is on who can process massive data more effectively and ensure security, reliability, and integrity. According to the current development, relational databases are basically eliminated because they do not meet this huge storage capacity and computing requirement. Among the known big data processing software, hadoop's hbase uses column storage, MongoDB is a file-type Row Storage, and lexst is a binary Row Storage. Here, I will not discuss the technology and advantages and disadvantages of these software. I will analyze the storage features of Row-store and column-store based on the physical characteristics of mechanical disks, and the resulting problems and solutions.
I. structure layout
Arrange row-store data
Column storage data arrangement
The gray background section of the table indicates the row and column structure, and the white background section indicates the physical distribution of data. The two types of data are stored from top to bottom, from left to right (they are stored in a horizontal unit on the hard disk. In this way, a complete record is stored, column stores the same field data of multiple records ). Row-based storage is based on a row of records, and column-based storage is based on the unit of column data set, or column family ). The read and write processes of row store are the same, starting from the first column to the end of the last column. A column store reads one or all data in a column dataset. When writing data, a row of records is split into multiple columns, and each column of data is appended to the end of the corresponding column.
Ii. Comparison
From the table above, we can see that the writing of row store is completed once. If the write is based on the file system of the operating system, the writing process can be successful or failed, and the data integrity can be determined. Column storage Splits a row of records into a single column for storage, and the number of writes is significantly higher than that of Row Storage. In addition, it takes more time to move the head on the disk and locate the disk. Therefore, row store has a great advantage in writing.
There is also data modification, which is actually a write process. The difference is that data modification marks the deletion of records on the disk. Row store writes data at a specified position. Column store writes data to multiple columns on the disk separately. This process still doubles the number of columns in row store. Therefore, data modification is also dominated by row-store. During data reading, row store generally reads a row of Data completely. If only a few columns of data are required, redundant columns exist. To shorten the processing time, the process of removing redundant columns is usually performed in the memory. Each time the data read by column storage is a collection of one or all columns, if you read multiple columns, you need to move the head and locate the next column again to continue reading. Let's talk about the data distribution of the two types of storage. The Data Types of each column stored in a column are homogeneous, so there is no ambiguity. For example, if the data type of a column is INTEGER (INT), the integration of its data sets must be integer data. This makes data parsing very easy. In contrast, row-based storage is much more complex, because multiple types of data are stored in a row of records, and data parsing requires frequent conversion between multiple data types. This operation consumes a lot of CPU, added resolution time. Therefore, the parsing process of column store is more conducive to analyzing big data.
Iii. Optimization
Obviously, both storage formats have their own advantages and disadvantages: Row-store writes are completed at one time, which consumes less time than column-store and ensures data integrity, the disadvantage is that redundant data is generated during data reading. if there is only a small amount of data, this impact can be ignored; a large number may affect the data processing efficiency. Column-store is inferior to row-store in terms of writing efficiency and data integrity. Its advantage is that the reading process does not produce redundant data, which is a big data processing field with low data integrity requirements, for example, the Internet is still important.
The improvement focuses on two aspects: avoiding redundant data during row-store reading and column-store improving read/write efficiency.
How can we improve their shortcomings and ensure their advantages?
Improvement of Row-store: reducing redundant data first prevents redundant columns when you define data. Second, optimizing the data storage record structure to ensure that the data read from the disk enters the memory, can be quickly decomposed to eliminate redundant columns. You know, at present, even the lowest-end CPU and memory speed is-times faster than that of mechanical disks. If you use the high-end hardware configuration, the process will be faster.
Two improvements to column-store: 1. Install Multiple hard disks on a computer and read and write them in parallel using multiple threads. Parallel Operation of multiple hard disks can reduce disk read/write competition. This method has obvious advantages in improving processing efficiency. The disadvantage is that more hard disks are needed, which will increase the investment cost. This is a large number of large-scale data processing applications. operators need to seriously consider this issue. 2. for data integrity issues in the write process, you can consider adding a "rollback" mechanism similar to the relational database during the write process. When a column fails to be written, all previously written data is invalid, and hash code verification is added to further ensure data integrity.
The two storage solutions also have a common improvement: frequent writing of a small amount of data has a great impact on the disk. A better solution is to temporarily save and sort the data in the memory, after a certain amount of data is reached, it will take less time to write data to the disk at a time. Currently, the write speed of a mechanical disk is between 20 m-50 m/second. It can be written to a disk in batches, and the effect is good.
Iv. Summary
The features of the two storage formats determine that they cannot be a perfect solution. If the primary consideration is data integrity and reliability, row-based storage is the best choice. Column-based storage can be close to this goal only after the disk is added and the software design is improved. If data is mainly stored, The Write Performance of row store is much higher than that of column store. Column store is the most suitable option for applications that require frequent reading of single column set data. If you read multiple columns at a time, you can select two solutions as appropriate: when using row-based storage, you should consider reducing or avoiding redundant columns. If you use column-based storage solutions, to ensure read and write efficiency, each column of data is saved to different disks as much as possible, and multiple threads read and write data in parallel. This avoids disk competition and improves processing efficiency. No matter which solution you choose, it is necessary to combine the same content and data. This is an effective way to reduce the movement of the head on the disk and increase the data read time.
Link: http://www.infoq.com/cn/articles/bigdata-store-choose