A few graphs to read the column-type storage
The parallel execution of SQL queries is derived from Dremel and Impala , so we take this opportunity to learn more about relational databases and the parallel computation of relational algebra .
speedup and scaleup
Speedup refers to using twice times the hardware for half of the execution time .
Scaleup refers to twice times the hardware in exchange for the same time to perform two times the task .
But often things are not that simple, and two times the hardware also brings other problems :
- More CPU-generated long boot time and communication overhead ,
- and the data skew problem caused by parallel computing
Multi-processor architecture
Shared memory : any CPU can access any memory (Global share ) and disk .
- The advantages are simple,
- The disadvantage is poor scalability and low availability.
shared disk : Any CPU can access any disk, but can only access its own main memory .
- The advantage is that usability and extensibility are better,
- The disadvantage is the implementation of complex and potential performance issues .
do not share : any CPU can only access its own main memory and disk .
- The benefits are also scalability and availability,
- The disadvantage is to achieve complex and complex equalization .
Hybrid : The system as a whole is a shared nothing schema, but the nodes may be inside other architectures.
- This blends the benefits of multiple architectures.
Data partitioning
The purpose of data partitioning is to allow the database to read and write data in parallel , maximizing the potential for I/O .
Common Partitioning algorithms are:round-robin, range index, hash .
Parallelism of relational operations
the properties of the relational algebra itself allow parallelization of relational operations .
Parallel query processing is divided into four main steps:
- Translation: A relational algebraic expression is translated into a query tree .
- optimization : rearrange the join sequence and select different join algorithms to minimize execution overhead .
- parallel : transforms the query tree into a physical operation tree and loads it into the processor .
- execution : runs the final execution plan in parallel .
First, a SQL statement is translated into a query tree.
Then, according to the table size, index, and so on, rearrange the join order, and select the appropriate algorithm.
There are several common types of join algorithms:
- Nested Loop Join: The idea is simple, equivalent to a two-layer loop traversal , the outer layer is the driver table, return the row that satisfies the association condition.
Applies to the case where the drive table is small (after conditional filtering) and the join field on the driver table is indexed . The efficiency is poor when both tables are large.
For each row R1 in the outer table for each row R2 in the inner table if R1 joins with R2 return (R1, R2)
- Sort-merge Join: The idea is simple, sort by the Join field, and then merge sort .
When duplicate values exist for the join field , a partition is formed for each duplicate value . Whether the join field is sorted and how many values are repeated determines the efficiency of the sort-merge .
This is useful for situations where both tables are large , especially when there is a clustered index on the join field (which is the equivalent of a sequential order ), and is highly efficient. the algorithm is primarily consumed on disk .
- hash join : sort-merge, similar to existence of duplicate values , is just use a hash function for partitioning .
< Span lang= "en-US" > The idea is to scan small tables to establish a hash table (build stage , small table also called build table scan large table by row for comparison (probe stage , big table is also called probe table
< Span lang= "en-US" > applies to If both tables are large and do not have an index , the limit is only applicable to equivalent connections . The algorithm primarily consumes on CPU .
< Span lang= "en-US" > /span>
- In addition, for subqueries there are algorithms such as Semi join and anti join .
< Span lang= "en-US" > turns the query tree into a physical operation Tree , which is true execution plan .
Then , according to the resource situation of the cluster, dispatch to the appropriate node for parallel computation .
Five large storage Models
Yesterday, with a colleague to discuss Sybase is not a relational database, colleagues said Sybase is a column storage, should belong to NoSQL, I have been the memory Sybase is a relational database, and then specifically to check the data, only to find that colleagues said Sybase IO is a column-type storage ; And I'm talking about Sybase SQL Server, which is a relational database . See this article on the Internet, is considered to several database models to make up for the missed lessons.
Database market needs subdivision, row database no longer meet all the requirements , and there are many needs to be solved through in- memory database and column database , the column database in the data analysis, mass storage, bi three areas have their own unique .
1. Relational database (row-based database) MySQL Sybase Oracle
Definition: The relational model is stored using records (rows or tuples) , records are stored in tables, and tables are defined by the schema .
- each column in the table has a name and type, and all records in the table conform to the table definition .
- SQL is a specialized query language that provides the appropriate syntax for finding records that match a condition , such as a table join (join).
- table joins can query records between multiple tables based on relationships between tables .
Storage format: The row database stores the data values in a row together, then stores the next row of data , and so on.
For example, one of the following tables:
EmpId |
Lastname |
Firstname |
Salary |
1 |
Smith |
Joe |
40000 |
2 |
Jones |
Mary |
50000 |
3 |
Johnson |
Cathy |
44000 |
1,smith,joe,40000;2,jones,mary,50000;3,johnson,cathy,44000;
Characteristics:
- Space allocation based on a row-related storage architecture ,
- Mainly suitable for small batches of data processing,
- Commonly used for online transactional data processing.
Cannot meet the following three requirements:
- high concurrency and read-write requirements for databases,
- Efficient storage and access to massive amounts of data ,
- High scalability and high availability of the database.
A sentence is not suitable for distributed, high concurrency, and mass .
2. Column Storage Sybase IQ, C-Store, vertica,hbase
Definition: What is a column database? A column database is a database that stores data in a column-dependent storage schema .
- Columnstore stores all of the data in a column as a stream ,
- Mainly suitable for batch data processing and ad hoc querying .
Storage format:
A column database stores the data values in a column together, then stores the next column of data , and so on.
; Smith,jones,johnson; joe,mary,cathy;40000,50000,44000;
Characteristics:
- Including query Fast , because the query needs to read less blocks;
- data compression is high, because columns of the same type are stored together.
- load Fast .
- Simplify the complexity of data modeling .
- But the insert update is slow and not very suitable for the data always changing , it is stored by column.
That's when you know it 's a DSS (decision support System), a good choice for BI, a data mart, a data warehouse, and it's not for OLTP.
Examples is Sybase IQ, C-Store, Vertica, Vectorwise,monetdb, Paraccel, and Infobright.
3. Key value storage Cassandra, Hbase, Bigtable
That is , Key-value storage, referred to as KV storage .
- It's a way of NoSQL storage .
- Its data is organized, indexed and stored in the form of key-value pairs .
- KV storage is ideal for business data that does not involve excessive data relationships, business relationships , and can effectively reduce the number of read and write disks , as well as better read and write performance than SQL database storage.
A typical example Sorted String table is sstable.
In fact, in the STL library map and Hash_map, Java hash_table, Hash_map is the key value store.
-
- But their values only support memory operations ,
- and The map query efficiency is too low ,
- The point is they're just simple data structures ,
- cannot achieve large-scale storage and distribution,
- And the data modification efficiency is relatively low .
And Sstalbe solves these problems.
Key-value storage is actually a kind of distributed tabular system .
Distributed Key-value Systems have Cassandra, hbase, BigTable etc
Note: HBase is also a column-type store
4. Document Storage
Document storage supports access to structured data , unlike the relational model, where document storage does not have a mandatory schema .
In fact, document storage is stored in the form of a packet key-value pair .
- In this case, the app takes some conventions for the packets to be retrieved ,
- or use the ability of the storage engine to divide different documents into different collections to manage the data .
Unlike the relational model, the document storage model supports nested structures .
For example, the document storage model supports XML and JSON documents, and the value of a field can be nested to store other documents .
The document Storage model also supports array and column value keys .
Unlike key-value stores, document storage is concerned with the internal structure of a document .
- This allows the storage engine to directly support a level two index , allowing efficient querying of any field .
- The ability to support the nesting of documents enables the query language to have the ability to search for nested objects , and XQuery is an example. MongoDB implements similar functionality by supporting the specification of JSON field paths in queries.
MongoDB has a more comprehensive database of SQL and ACID support. However, more is the introduction of log collection and storage, small files distributed storage , similar to the Internet microblogging application of data storage and other aspects of the content.
5. Graphics Database
The graphical database stores information about vertices and edges , and some supports adding annotations .
A graphical database can be used to model things , such as social maps, real-world objects. The content of the IMDB (Internet moviedatabase) site consists of a complex image, where actors and films are intertwined.
The query language of a graphical database is typically used to find the path of a graphics breakpoint, or a property of a path between endpoints . The neo4j is a typical graphical database.
relational database _ Relational algebra Parallel Computing _ Database classification