This article is for reading book next Generation Databases:nosql, Newsql, and Big Data Chapter 7th: The End of Disk? SSD and In-memory databases notes.
The disk is dead?
Bill Gates said in 1981:
640K of memory should is enough for anybody.
In 2001, he realized that the previous statement was wrong:
I ' ve said some stupid things and some wrong things, but not that.
Since the birth of the first database, database experts have been trying to avoid disk I/O, because the disk is several orders of magnitude slower than memory and CPU.
As the CPU, memory and disk densities follow Moore's law, disk speed lags far behind due to the mechanical nature of the disk.
The production of SSDs has led to a leap in the performance of the disk, which in recent years has made it a mainstream technology for improving database performance. At the same time, the increase in memory capacity and the fall in prices also make memory technology an alternative to accelerating database performance, with small databases that can be placed entirely in memory, and larger databases can be accommodated through memory clusters. In demanding performance environments, memory is more attractive than SSDs.
Disk development in the 50 's, despite the increasing disk density, the core architecture of the disk is basically unchanged, or the robotic arm moving the head to the disc on the track to read data.
After the mechanical disk, there is a solid state drive (SSD), the SSD has no moving parts, and I/O latency is low. Currently, most SSDs are implemented with NAND flash.
SSDs have significantly improved read performance, for example, a random reading of a mechanical disk takes 4 milliseconds, while a high-end SSD requires only 25 microseconds and is 150 times times faster.
But SSDs do not apply to all workloads, and for modification operations, SSDs are generally performed.
SSD store information in the cell, for SLC, each cell to save a bit information, for MLC, each cell usually save two bit, or 3 bit information. Cell organized into 4k page,256 page and organized into block.
Read operations, and initial write operations, require only a single-page IO. However, changing the contents of a page requires an erase and overwrite of a complete block.
This is the weakness of the SSD, SSD for write performance is much lower than the read, may not be much faster than the physical disk.
Flash SSD is significantly slower when performing writes than when performing reads.
As the SSD has a certain erase life, in order to avoid the overhead of erase operation, the manufacturer uses the technology of ' wear leveling ' to prolong the disk life by evenly distributed writing.
Read operations, and initial write operations, require only a single-page IO. However, changing the contents of a page requires an erase and overwrite of a complete block.
Will SSDs replace disks? At the moment, though, the cost of SSDs is falling, and the cost of disk is falling at the same time. SSD has strong unit I/O capability, and disk has better unit cost advantage. A reasonable result is the combination of memory, SSD and disk.
SSDs are suitable for random writes, physical disks are suitable for sequential writes, and running database performance on SSDs may not necessarily improve. For example, when the transaction is submitted, it will write redo log, which is the sequential write. Therefore, the write-intensive database is not suitable for SSDs.
In-memory database
SSD changes to the database schema are gradual, and as memory prices fall and capacity increases, putting the entire database into memory brings a revolutionary architectural change. (See http://www.jcmit.com/memoryprice.htm)
For small-to-medium databases, which can already be fully in memory, a larger database can utilize memory clustering techniques to take advantage of the performance of the RAM.
Traditional databases use memory as a cache to improve performance, but some operations such as commit and checkpoint still access the physical disk, so the in-memory database requires architectural changes that do not use the cache and the new persistence pattern.
Memory data typically uses the following techniques to ensure that data is not lost:
1. Data replication
2. Write the entire database image to disk (snapshot or checkpoint)
3. Write transactions to the transaction log or journal
TimesTen
TimesTen is the development of an earlier relational memory database, established in 1995 and acquired by Oracle in 2005.
TimesTen early adherence to the Ansi-sql Standard, was acquired after the expansion of PL/SQL.
The entire database of the TimesTen database is placed in memory, ensuring that data is not lost through checkpoint and transaction log technology.
Unlike Oracle, in the default mode, write is asynchronous, which can result in data loss, so durability in acid is not guaranteed. Although synchronization mode can also be specified, this affects performance.
The TimesTen architecture is as follows:
Redis
Unlike TimesTen's relational memory databases, Redis (Remote Dictionary Server) uses Key-value memory storage.
Redis was generated in 2009, after which VMware and pivotal funded it.
Redis queries only support primary keys as key, and two-level queries are not supported. The value aspect is usually a string collection, such as list, hash map, and so on.
Although Redis is designed to put an entire database in memory, it can also be combined with a disk to accommodate larger databases. This feature is TimesTen.
It is possible for Redis to operate on datasets larger than available memory by using its virtual memory feature. When the is enabled, Redis would "swap out" older key values to a disk file. Should the keys be needed they'll be a brought back into memory. This option obviously involves a significant performance overhead, since some key lookups would result in disk IO.
The Redis persistence strategy is similar to TimesTen and also uses Snapshot and journal (the term Redis is called append only File or aof). Redis also supports asynchronous data replication, but it does not guarantee that data is not lost.
Redis's aof can be configured to write after each operation, which is similar to TimesTen's durable Commit, which has an impact on performance.
The architecture of Redis is as follows:
Redis is popular among developers as a simple, high-performance Key-value store that performs well without expensive HARDW Is. It lacks the sophistication of some other nonrelational systems such as MongoDB, but it works well on systems where the DA Ta would fit into the main memory or as a caching layer in front of a disk-based database.
HANA
SAP introduced Hana in 2010, primarily for BI, but it is also available for OLTP.
Hana is also a relational database, with data tables that can be either column or row-based, with only one choice, typically OLTP with row-style, and bi-selected columns. It will be mentioned later that in-memory in Oracle 12c supports both row and column for a table.
Row-based data ensures that in-memory, column data is loaded from disk by default. For Oracle in-memory, row-style data uses previous caching techniques, and column data is similar to Hana.
Hana's persistence strategy is similar to Timesten,redis, and it also periodically writes memory to the SavePoint file, and then merges with the data file.
Transactional consistency is guaranteed by the transaction log to write logs on Commit, and Hana recommends placing logs on SSD disks to avoid the impact of I/O performance.
The architecture of Hana is as follows:
This diagram illustrates that the write to a columnstore table is first stored as a non-compressed row (L1 data store) and then converted to a compressed column (L2 data store)
Transactions to columnar tables is buffered in the this Delta store. Initially, the data is held in row-oriented format (the L1 delta). Data then moves to the L2 Delta store, which are columnar in orientation but relatively lightly compressed and unsorted. Finally, the data migrates to the main column store, in which data is highly compressed and sorted.
Hana involves disk operations when loading memory from disk on commit and column tables as needed.
Voltdb
Timesten,hana,redis inevitably have disk reads and writes, while Voltdb claims to be a pure memory database, transactions are committed in memory, and then persisted through memory replication. For example, if you want to ensure that k-safety is 2, you need to replicate at least two machines.
VOLTDB also supports disk logs, which can be synchronous or asynchronous.
The VOLTDB supports relational mode, but the best fit is to partition data or partitions based on key. If there is no association between the data can greatly improve the non-sexual, if you need to aggregate operations to access multiple nodes, there is a cost.
More details can be found in the article.
Oracle 12c "In-memory Database"
In-memory is a 12 option that uses column storage to supplement disk-based row-type storage, primarily for Hana. So not all of the data is in memory. Each table has both row and column formats, which are transparent to the user and do not need to be changed by the application.
The architecture of the Oracle 12c In-memory is as follows:
The IMCU represents a columnstore, In-memory columnar unit, while the snapshot Metadata unit (SMU) is used to track the validity of data in IMCU (whether modified by an OLTP application) to synchronize with row-based data.
Oracle's article Oracle's in-memory Database strategy for OLTP and analytics is a good description of the differences between TimesTen and In-memory.
Berkeley Analytics Data Stack and Spark
If TimesTen and Hana fit in a memory relationship, and Redis fits in memory Key-value, spark fits in memory Hadoop.
Hadoop has become the cornerstone of big data processing, and the traditional MapReduce disk-based approach is suitable for batch processing of unstructured and semi-structured data. The advent of Spark brings Hadoop to the field of real-time processing.
In 2011, Amplab was established at UC Berkeley to address advanced analytics and machine learning issues in big data environments, followed by the Berkeley Data Analysis Stack (Bdas) including Spark,mesos (Memory cluster Management, Similar to yarn) and Tachyon (memory Distributed File system), where spark is used extensively.
Spark is a memory-based, fault-tolerant distributed processing framework that makes MapReduce a top-level abstraction, which improves development efficiency and solves disk I/O problems with traditional mapreduce.
Spark works closely with Hadoop and uses HDFs for persistence.
Spark also supports SQL,
Spark SQL provides a spark-specific implementation of SQL for ad hoc queries, and there are also an active effort to allow Hive-the SQL implementation in the Hadoop stack-to generate Spark code.
The framework for Spark, Hadoop, and the Berkeley Data Analytics stack is as follows:
Cloudera, Hortonworks and MAPR are all integrated with spark.
Spark is based on the JVM implementation, where spark can store strings, Java objects, or key-value storage.
Although Spark wants to process data in memory, Spark is primarily used in situations where all data cannot be completely put into memory.
Spark does not target OLTP, so there is no concept of transaction logs.
Spark also has access to JDBC-compliant databases, including almost all relational databases, as well as NoSQL databases such as HBase or Cassandra.
The processing flow of spark is as follows:
Conclusion
It is too early to assert that the disk is dead, after all, the disk has a cost advantage over cold data and warm data, especially in the big data age.
When performance requirements are more important than storage costs, SSD and memory databases can be considered, especially when the database can be placed in memory, and the memory database is more attractive.
In short, there is no technology to solve all the problems, depending on your needs, budget and other factors, more see should be the combination of disk, SSD and memory technology.
SSD and in-memory database technology