About Apache CarbonData1.3

Source: Internet
Author: User

Carbondata is a high-performance big data storage solution that supports fast filtering lookups and ad hoc OLAP analysis, and has deployed applications in 20 + enterprise production environments, with the largest single cluster data size reaching trillions of. To solve the problem of storage redundancy due to the different requirements of the current big data domain analysis scenarios, the business-driven data analysis flexibility requirements are more and more high, Carbondata provides a new converged data storage scheme, supporting multiple application scenarios with a single piece of data, and through multilevel index, dictionary encoding, pre-aggregation, Dynamic partition, quasi-real-time data query and other features enhance IO scanning and computing performance, and achieve a trillion data analysis of second-level response. It is mainly applicable to query the scene (space-time), especially the group aggregation is relatively fast

Key features of the Carbondata 1.3.0:

1. support for integration with Spark 2.2.1

Carbondata 1.3.0 supports integration with the latest spark stable version of Spark 2.2.1.

2. supports pre-aggregation, flexible and high-performance multidimensional analysis without the need for pre-data modeling

In 1.3.0, the pre-polymerization characteristics of carbondata, the most significant difference from the cube scheme of traditional BI systems is that users do not need to pre-model cube modeling and modify any SQL statements to accelerate OLAP analysis performance and to query detail data, so that a single piece of data to meet a variety of application scenarios. The specific usage is as follows:

A) Create the main table:

CREATE TABLE Sales (

Order_time TIMESTAMP,

USER_ID STRING,

Sex STRING,

Country STRING,

Quantity INT,

Price BIGINT)

STORED by ' Carbondata '

b) Create a pre-aggregated table based on the main table sales above:

CREATE DATAMAP agg_sales
ON TABLE sales
USING "preaggregate"
AS
SELECT country, sex, sum(quantity), avg(price)
FROM sales
GROUP BY country, sex

c) When the user makes the query, the first user does not need to change the SQL query statement, still uses the main table to query. When a query request is received, Carbondata automatically optimizes SQL for cost-based optimization, overwriting SQL as a query on a pre-aggregated table, such as the following statements that hit a pre-aggregated table, significantly improving query performance:

SELECT country, sex, sum(quantity), avg(price) FROM sales GROUP BY country, sex
// 命中,完全和聚合表一样
SELECT sex, sum(quantity) FROM sales GROUP BY sex;//命中,聚合表的部分查询
SELECT country, avg(price) FROM sales GROUP BY country;//命中,聚合表的部分查询
SELECT country, sum(price) FROM sales GROUP BY country
//命中,因为聚合表里avg(price)是通过sum(price)/count(price)产生,所以sum(price)也命中
SELECT sex, avg(quantity) FROM sales GROUP BY sex; //没命中,需要创建新的预聚合表
SELECT max(price), country FROM sales GROUP BY country;//没命中,需要创建新的预聚合表

//
没命中,需要创建新的预聚合表

d) in version 1.3.0, the supported pre-aggregation expressions are: SUM, AVG, MAX, MIN, COUNT

e) The measured performance on the 100 million data volume can be increased by more than more than five, the larger the data, the better the performance improvement effect. We can refer to the example:/apache/carbondata/examples/preaggregatetableexample.scala

3. Pre-aggregation of time-dimensions for time series data, support for automatic roll-up

For time series data, Carbondata has built-in understanding of the time dimension, providing users with an easy-to-use syntax for creating a pre-aggregated table. The specific usage is as follows:

A) Create the main table:

CREATE TABLE Sales (

Order_time TIMESTAMP,

USER_ID STRING,

Sex STRING,

Country STRING,

Quantity INT,

Price BIGINT)

STORED by ' Carbondata '

b) Create an aggregation table of year, Month, day, Hour, minute granularity, respectively:

//创建年粒度聚合表
CREATE DATAMAP agg_year
ON TABLE sales
USING "timeseries"
DMPROPERTIES (
‘event_time’=’order_time’,
‘year_granualrity’=’1’) AS


FROM sales GROUP BY order_time, country, sex
//创建月粒度聚合表
CREATE DATAMAP agg_month
ON TABLE sales
USING "timeseries"
DMPROPERTIES (
‘event_time’=’order_time’,
‘month_granualrity’=’1’) AS


FROM sales GROUP BY order_time, country, sex
//创建天粒度聚合表

ON TABLE sales

DMPROPERTIES (
‘event_time’=’order_time’,
‘day_granualrity’=’1’) AS


FROM sales GROUP BY order_time, country, sex
//创建小时粒度聚合表
CREATE DATAMAP agg_sales_hour
ON TABLE sales
USING "timeseries"
DMPROPERTIES (
‘event_time’=’order_time’,
‘hour_granualrity’=’1’) AS


FROM sales GROUP BY order_time, country, sex
//创建分钟粒度聚合表


USING "timeseries"
DMPROPERTIES (
‘event_time’=’order_time’,



FROM sales GROUP BY order_time, country, sex

c) Users do not have to create all time-granular aggregation tables, the system supports automatic roll-up roll-up, such as: has created a day granularity of the aggregation table, when querying the year, month granularity of the group by aggregation, the system will be based on the aggregate good day granularity value of the year, The aggregate value of the granularity of month:

//创建天粒度聚合表


DMPROPERTIES (
‘event_time’=’order_time’,



FROM sales GROUP BY order_time, country, sex

(year, month aggregate query for granularity, which can be rolled up with the agg_day created above)


FROM sales group by timeseries(order_time,’month’)

FROM sales group by timeseries(order_time, ’year’)

This attribute is Alpha, the current time granularity support is set to 1, such as: Support by 1 days aggregation, temporarily does not support the specified 3 days, 5 days of granularity aggregation, the next version will be supported. Support Auto Roll up (Year,month,day,hour,minute)

4. stream quasi-real-time storage, improve data timeliness and avoid small file problems

In many business scenarios, data timeliness is a very important indicator of how long data is generated and can be queried. In the existing open-source big data scheme, can do real-time storage and real-time query of the scheme is very few, and real-time storage often accompanied by the production of small files, so generally need a combination of multiple storage systems, such as hbase to do real-time warehousing, daily export to parquet file with Impala for analysis, Data timeliness is delayed for a day, and multiple clusters need to be managed, and maintenance is very difficult.

In CarbonData1.3.0, the data can be queried in real time by using the row-and-column hybrid storage and integration with Spark's structured streaming, enabling users to import data into the Carbondata table. Because real-time imported tables and history tables are the same table, there is no need to switch back and forth between tables, just querying a single table to query both real-time and historical data.

A) Get data in real time:

Val readsocketdf= Spark.readstream
. Format ("socket")
. Option ("Host", "localhost")
. Option ("Port", 9099)
. Load ()

b) write the data to the Carbondata table

Qry = Readsocketdf.writestream
. Format ("Carbondata")
. Trigger (Processingtime ("5 Seconds"))
. Option ("Checkpointlocation", Tablepath.getstreamingcheckpointdir)
. Option ("DbName", "Default")
. Option ("TableName", "carbon_table")
. Start ()

(Refer to Example/apache/carbondata/examples/carbonstructuredstreamingexample.scala)

5. support for standard hive partitions, compatible with Spark and hive partitioning methods

Hive static Partitioning and dynamic partitioning have been used for many years and are already familiar to most developers, and Carbondata 1.3.0 supports standard hive partitioning, making it easier for big data applications to migrate to Carbondata scenarios.

Because Carbondata has the ability to index, the user can choose to combine with Sort_columns, set up the MDK index in the partition data, achieve the effect of multi-level index, and satisfy the fast filtering query of any dimension combination, so that a data satisfies many application scenarios.

For example, create the following table, set Productdate as the partition field, the data to be partitioned by day, and then build the multidimensional index through Sort_columns. This allows you to quickly query data according to Productdate,productname, storeprovince,storecity any combination of filters.

CREATE TABLE IF NOT EXISTS productSalesTable (
productName STRING,
storeProvince STRING,
storeCity STRING,
saleQuantity INT,
revenue INT)
PARTITIONED BY (productDate DATE)
STORED BY ‘carbondata‘
TBLPROPERTIES(‘SORT_COLUMNS’ = ‘productName, storeProvince, storeCity’)

6. support for Create TABLE as SELECT Syntax

The CTAS (createtable as Select) allows the user to create a new Carbondata table from a select in the Parquet/hive/carbondata table, allowing the user to parquet, The hive format data is converted to Carbondata format data.

CREATE TABLE carbon_table STORED BY ‘carbondata‘ AS SELECT * FROM parquet_table

7. support for specifying import batches for querying

Carbondata each batch of imported data produces a segment, in 1.3.0 The user can specify the segment ID to query a segment data, that is, the user can specify a data batch query.

A) query segment ID list

b) Set segment ID

SET carbon.input.segments.<databasename>.<table_name> = <list of segment IDs>

(Refer to Example:/apache/carbondata/examples/querysegmentexample.scala)

8. Apache Carbondata Official website:carbondata.apache.org

1.3.0:https://dist.apache.org/repos/dist/release/carbondata/1.3.0

About Apache CarbonData1.3

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.