Vertica: DBMS schema-based Columnstore Data Warehouse

Source: Internet
Author: User

Introduced

Vertica (HP) is a DBMS-based database system suitable for read-intensive analytical database applications, such as data warehouses, and the full name of the white paper is called Verticaanalytic database. As can be seen from the name, Vertica represents its data storage is a column, analytic is suitable for analytical needs, DB represents itself a database, support SQL.


Advantage

Compared to traditional relational database systems and other column-based data (silos) libraries, Vertica has the most critical advantages of the following three points.

Column Storage

Vertica is a columnstore of data on disk, it is obvious that columnstore can avoid unnecessary columns when the data is read, reduce IO bandwidth, and the Columnstore mate compression algorithm can save disk storage. Vertica's column storage also has its optimizations in terms of compression, see below.

Aggressivecompression

This is understood as aggressive compression. In fact, in terms of compression, Vertica for different data types, using a variety of different compression methods, so that the original disk on the large amount of IO overhead and CPU compression work to achieve a better tradeoff. The compression rate is approximately 90%. In addition, Vertica's write data and read data are separated, the read data in a compressed state exists on the disk, the written data is first in memory, the asynchronous merge to disk, this is also mentioned below.


Multi-Backup

Because the compression ratio is very high (90%), you can free up enough disk space to do the backup. This is crucial. First, backup makes Vertica fault tolerant, and backups on multiple machines can improve Ad-hoc query performance. Second, it is possible to fault-tolerant by backup, which makes it different from Vertica and traditional databases through logging and two-phase commits. The reason to do a long backup is because its columnstore compression is better than the optimization, this is more than the other Columnstore data Warehouse has advantages.

Each cell that is stored after the table is split is called projection, and each projection is sorted by one or some attribute, and the different copies are sorted differently, so this is useful for queries, which are also mentioned later in this section.

Of course, Vertica compression, column-oriented storage, and table-torn storage are transparent to the user. Vertica provides a standard SQL interface to front-end users and is compatible with existing ETL,REPORTING,BI tools, making it easier for other business systems to migrate to Vertica.


Vertica also has no special requirements for hardware, which can be run on inexpensive clusters, or on any existing Linux machine, using local disks for storage.

In addition to some of the advantages mentioned above, Vertica has the following advantages in terms of performance, scalability, usability, and ease of use.

shared-nothing,grid-based Database Schema

Vertica can be efficiently expanded on ordinary machines. In fact, in the data backup this piece, Vertica provides the k-safety way, k+1 is the backup number, is adjustable, and each copy is the complete data, later will mention.


Hybrid data storage

The mix here refers to memory and disk. On the one hand, the newly inserted data is written to the memory structure optimized for writes, so it supports persistent, high-performance concurrent writes, and on the other hand, does not affect real-time query performance because the read data comes from the disk. In fact, write data in memory is chaotic, so write throughput will be higher, and read data on disk is orderly.

Physical Database design Tools

This means that when using Vertica, it provides a handy tool to help users design physical tables, as described below. and provide k-safety mechanism to ensure fault tolerance.

High performance compatible acid database

There are lightweight transactions and concurrency control scheme for querying and data loading. and has a k-safety-based failure recovery model, rather than a traditional log-based.

Easy deployment, monitoring, management

Slightly


Architecture


To vertica the schema of a single node, we see that queries and updates are going to be called hybrid storage, namely Write-optimizedstore (hereinafter referred to as WOS) and read-optimized Store (hereinafter referred to as ROS). WoS in memory, the corresponding data is written, the data inside is not sorted or compressed. Ros is a sorted and compressed data that exists on disk and provides a format that supports efficient querying. The Tuple mover is responsible for migrating WOS data to Ros, sorting and compressing the data in WOS memory in batches and moving it to Ros, the disk, which is also guaranteed to be efficient. Within WoS and Ros, data is stored in a column-oriented pattern.

The following diagram shows how the table is transformed into the column form of the physical storage after it has been deposited into the vertica, which is cut into projections.


Vertica's Database Designer helps to choose a reasonable way to cut into physical storage, which is the advantage of the above mentioned ease of use, and the following will briefly refer to how to do it.

It looks like a waste of disk for every projection redundant backup storage on multiple machines, but Vertica's aggressive compression provides a 90% compression rate, so don't worry.

The following shows a rough flow of how data is merged from WoS to Ros,tuple mover.


Specifically not specific interpretation, is generally asynchronous and bulk of the merge, and projection is vertically cut into segments, will be convenient and efficient merging.


Performance

This section slightly delves into and contrasts the performance benefits of Vertica.

Column storage aspects

For ordinary queries, the main is disk access and CPU cycles occupied time, which time is the query time, or is the bottleneck. In that row of storage, it is vertica a very important advantage to use compression to fully utilize CPU resources, reduce IO overhead, and achieve reasonable tradeoff on CPUs and IO.

Other database systems can reduce time in some query scenarios by supporting materialized views or data cubes, similar to doing some estimation and preprocessing to optimize query speed. However, this has limited support for query scenarios and is still less than the query performance provided by Vertica.


Compression aspects

In addition to providing RLE (Run-length encoding), the Vertica also provides a delta encoding for continuous data and an efficient LEMPELZIV implementation, which is ideal for sorting out most of the values that are not the same column data, or unsorted data. Provides a special way to compress floating-point and time data.

Ros aspects

The compression and sequencing of ROS itself makes data queries on disk more efficient. In addition, Ros is dense packed does not waste disk pages. Traditional databases tend to leave the page very empty so that they can continue to insert data without being collated, and disk space utilization is not the highest. Ros this part of the work is a tuple mover asynchronous, in the merger phase control should be more easy to control. In the query, it will also pre-read the larger size of the large Ros.


Sort aspect

As mentioned earlier, Vertica through multi-backup to do ha and fail recovery, different from traditional database log-based failure recovery method, the writing process is more expensive than Vertica. In addition, the additional sorting method speeds up the query, and Vertica chooses the best way to do the sorting.

Parallelshared-nothing on off-the-shelf hardware design

Not much to say, convenient linear expansion, no special requirements for hardware.


A benckmark of Vertica, for example.



Management

The previously mentioned DB Designer will help the user to divide the table and do better physical storage options, such as the following,


It participates in the logical schema, the sample data and the query statement three things to make a decision. and DB Designer can incrementally design the physical schema for the database again.

Some points on the HA guarantee of failure recovery and k-safety. Business and traditional database through logging and two-phase commit to ensure transactional way to do failure recovery, rollback and so on, Vertica by backup k+1 part of the implementation, just to have one not hanging off, on the line from its full copy recovery, fault tolerance is high, and based on high compression rate, We don't have to worry about the disk being so tight. This also allows Vertica to provide hot-swapping (hot swap) nodes, which are relatively easy to remove nodes and new nodes.

Traditional database is record-at-a-time or bulk loading way to insert new data, Vertica is different from this can provide continuous loading function, the query node is often in snapshot isolation mode, So it's read-only in a sense, so it doesn't have to be locked when it's written. And the data in the WOS does not need to be sorted or compressed, and the bulk write throughput is relatively high.


Summarize

Vertica compared with the traditional database system and other column Data warehouse system, there are obvious advantages in performance, there are some similarities and differences in design, which is more suitable for ad-hoc query, OLAP type of job. In general, Vertica reduces IO overhead through columnstore, combined with efficient compression, greatly saves disk space, based on which Vertica uses multiple backups to ensure high availability, and multiple backups can enhance query performance. In the use and operation of the angle, vertica with tools to help users do physical table storage, can provide standard SQL interface, but also compatible with existing BI, ETL tools to facilitate the migration of jobs to Vertica, and vertica deployment of hardware has no special requirements, can be linearly extended.



Complete the full text:)


Vertica: DBMS schema-based Columnstore Data Warehouse

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.