Vertica: DBMS schema-based Columnstore Data Warehouse

Source: Internet
Author: User

Introduced

Vertica (HP) is a DBMS-based database system suitable for read-intensive analytical database applications such as data warehouses, and the full name of the white paper is called Verticaanalytic database. As can be seen from the name, Vertica represents its data storage is a column, analytic is suitable for analytical needs, DB represents itself a database, support SQL.

Advantage

Compared to traditional relational database systems and other column-based data (silos) libraries, Vertica has the most critical advantages of the following three points.

Column Storage

Vertica the data on disk is Columnstore, it is obvious that the columnstore can avoid unnecessary columns when the data is read, reduce the IO bandwidth, and the Columnstore matching compression algorithm can save the amount of disk storage. Vertica's column storage also has its optimizations in terms of compression, see below.

Aggressivecompression

This is understood as aggressive compression. In fact, in terms of compression, Vertica for different data types, using a variety of different compression methods, so that the original disk on the large amount of IO overhead and CPU compression work to achieve a better tradeoff. The compression rate is approximately 90%. In addition, Vertica's write data and read data are separated, the read data in a compressed state exists on the disk, the written data is first in memory, the asynchronous merge to disk, this is also mentioned below.

Multi-Backup

Because of the high compression ratio (90%), it is possible to free up enough disk space for backup. This is crucial. First, backup makes Vertica fault tolerant, and backups on multiple machines can improve Ad-hoc query performance. Second, the ability to fault-tolerant by backup distinguishes between Vertica and traditional databases by logging and two-phase commit. The reason to do a long backup is because of the optimization of its columnstore compression is better, this is more than the other Columnstore data Warehouse has advantages.

Each cell that is stored after the table is split is called projection, and each projection is sorted by one or some attribute, and the different copies are sorted differently, so this is useful for queries, which are also mentioned later in this section.

Of course, Vertica compression, column-oriented storage, and table-torn storage are transparent to the user. Vertica provides a standard SQL interface to front-end users and is compatible with existing ETL,REPORTING,BI tools, making it easier for other business systems to migrate to Vertica.

Vertica also has no special requirements for hardware, which can be run on inexpensive clusters or on any ready-made Linux machine, using local disks for storage.

In addition to some of the advantages mentioned above, Vertica has the following advantages in terms of performance, scalability, usability, and ease of use.

shared-nothing,grid-based Database Schema

Vertica can be efficiently expanded on ordinary machines. In fact, in the data backup this block, Vertica provides k-safety way, k+1 is the number of backups, adjustable, and each copy is full data, which will be mentioned later.

Hybrid data storage

The mix here refers to memory and disk. On the one hand, the newly inserted data is written to the memory structure optimized for writes, so it supports persistent, high-performance concurrent writes, and on the other hand, does not affect real-time query performance because the read data comes from disk. In fact, write data in memory is chaotic, so write throughput will be higher, and read data on disk is orderly.

Physical Database design Tools

This is to say that when using Vertica, it provides a convenient tool to help users design physical tables, which are mentioned below. and provide k-safety mechanism to ensure fault tolerance.

High performance compatible acid database

There are lightweight transactions and concurrency control scheme, which are loaded for queries and data. and has a k-safety-based failure recovery model, rather than a traditional log-based.

Easy deployment, monitoring, management

Slightly

Architecture

In the image above for the vertica of a single node, we see that queries and updates are going to be called hybrid storage, namely Write-optimizedstore (hereinafter referred to as WOS) and read-optimized Store (hereinafter referred to as ROS). WoS in memory, the corresponding data is written, the data inside is not sorted or compressed. Ros is a sorted and compressed data that exists on disk and provides a format that supports efficient querying. The Tuple mover is responsible for migrating WOS data to Ros, sorting and compressing the data in WOS memory in batches and moving it to Ros, the disk, which is also guaranteed to be efficient. Within WoS and Ros, data is stored in a column-oriented pattern.

The following diagram shows how the table is transformed into the column form of the physical storage after it has been deposited into the vertica, which is cut into projections.


Vertica's Database Designer helps to choose a reasonable way to cut into physical storage, which is the advantage of the above mentioned ease of use, and the following is a brief reference to how to do it.

It seems that redundant backup storage on multiple machines per projection is a waste of disk, but Vertica's aggressive compression provides a 90% compression rate, so don't worry.

The following shows a general flow of data merging from WoS to Ros,tuple mover.


Detailed interpretation, is generally asynchronous and bulk of the merge, and projection is vertically cut into segments, will be convenient and efficient merging.


Performance

This section analyzes and contrasts the performance benefits of Vertica in a slightly deeper context.

Column storage aspects

For ordinary queries, the main is disk access and CPU cycles occupy the time, which is long is the query time, or is a bottleneck. In that row of storage, it is vertica an important advantage to use compression to fully utilize CPU resources, reduce IO overhead, and achieve reasonable tradeoff on CPU and IO.

Other database systems can reduce time in some query scenarios by supporting materialized views or data cubes, similar to doing some pre-calculations and preprocessing to optimize query speed. However, this has limited support for query scenarios and is still less than the query performance provided by Vertica.


Compression aspects

In addition to providing RLE (Run-length encoding), Vertica also provides a delta encoding for continuous data and an efficient LEMPELZIV implementation that is suitable for most of the column data that is not in the same order, or for unsorted data. Provides a special way to compress floating-point and time data.

Ros aspects

The compression and sequencing of ROS itself makes data queries on disk more efficient. In addition, Ros is dense packed does not waste disk pages. Traditional databases tend to leave the page empty enough to continue inserting data without re-grooming, and disk space utilization is not the highest. Ros this part of the work is a tuple mover asynchronous, in the merger phase control should be relatively easy to control. In the query, it will also pre-read large chunks of Ros query volume.


Sort aspect

As mentioned earlier, Vertica through multi-backup to do ha and fail recovery, different from traditional database log-based failure recovery method, the writing process is more expensive than Vertica. In addition, the additional sorting method speeds up the query, and Vertica chooses the best way to do the sorting.

Parallelshared-nothing on off-the-shelf hardware design

Not much to say, convenient linear expansion, no special requirements for hardware.


A benckmark, as of Vertica.

Management

The previously mentioned DB Designer will help the user to divide the table, making a better physical storage choice, the principle of


It refers to the logical schema, sample data and the query statement three things to make a decision. and DB Designer can incrementally redesign the physical schema for the database.

Some points on the HA guarantee of failure recovery and k-safety. Business and traditional database through logging and two-phase commit to ensure transactional way to do failure recovery, rollback and so on, Vertica by backup k+1 Part implementation, as long as one does not hang off, you can restore from its full copy, fault tolerance is high, and based on high compression rate, We're not worried about the disk being so tight. This also allows Vertica to provide a hot-swapping (hot swap) node, which makes it easier to remove nodes and add new nodes.

Traditional database is record-at-a-time or bulk loading way to insert new data, Vertica differs from this is the ability to provide continuous loading function, the query node is often in snapshot isolation mode, So it's read-only in a sense, so it doesn't have to be locked when it's written. And the data in the WOS does not need to be sorted or compressed, bulk write throughput is relatively high.


Summarize

Vertica compared with the traditional database system and other column Data warehouse system, there are obvious advantages in performance, there are some similarities and differences in design, more suitable for ad-hoc query, OLAP type of job. Overall, Vertica reduces IO overhead through columnstore, coupled with efficient compression, which saves disk space, Vertica multi-backup to ensure high availability, and multi-backup to enhance query performance. In the use and operation of the angle, vertica with tools to help users do physical table storage, can provide standard SQL interface, but also compatible with existing BI, ETL tools to facilitate the migration of Vertica, and Vertica deployment of hardware has no special requirements, can be linearly extended

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.