Spark SQL1.2 combined with HDP2.2

Last Update:2015-05-02 Source: Internet

Author: User

Tags databricks

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1.hbase the same Rowkey a research solution with multiple recording problems

VERSIONS = 3,hbase version inserts up to three records

There was a problem inserting the table "verticaldatatable" data from a cluster hbase into another table, there were more than 10 versions, but only 3 were inserted (or can be inserted successfully)

After the search was found because the table in the time, VERSIONS = 3 default is 3, VERSIONS is related to columnfamily so need to modify the VERSIONS property of the table ALTER TABLE {NAME = = ' columnfamily ', VERSIONS = ' 300 '} Useful:If the versions = 1 is inserted only one version, so that you can avoid the same rowkey in the case of duplicate records but usually when we do scan query: HBase in the use of Timerange and version

HBase (main):079:0> create ' scores ',{name=> ' Course ', versions=>2}//version 2hbase (main):080:0> put ' scores ', ' Tom ', ' Course:math ', ' the ':082:0> ' hbase (main) put ' scores ', ' Tom ', ' Course:math ', ' The ' HBase ' (main):026:0> scan ' Scores ' Row Column+cell Tom Column=course:math, timestamp=1394097651029, value=100 1 ROW (s) in 0.0110 seconds// The default scan out of hbase results in the last timestamp of the record HBase (main):032:0> scan ' scores ', {Versions=>2}row Column+cell Tom column=course: Math, timestamp=1394097651029, value=100 Tom column=course:math, timestamp=1394097631387, value=97 1 row (s) in 0.0130 sec onds//isolated Two records hbase (main):029:0> alter ' member ',{name=> ' info ', ' VERSIONS ' =>2}//modify VERSIONS

2.hive in-table deduplication solution

Insert Overwrite table store    select T.p_key,t.sort_word from       (select P_key,             sort_word,             row_number () Over (distribute by P_key Sort by Sort_word) as RN from       store) t

A typical table in hive, in addition to the weight of the wording, P_key for the addition of weight basis, Sort_word for the sort basis, generally for the time RN ranking.

2. Research solutions for simultaneous extraction of multiple data sources using Sparksql historical data (DBMS) and Big data platform

On Spark Submit 2014, Databricks announced the abandonment of shark's development and turned to spark SQL on the grounds that shark inherited too much hive and optimized bottlenecks

March 13, 2015 Databricks release version 1.3.0, the biggest highlight of this release is the newly introduced Dataframe API reference here

Currently HDP has support for Spark 1.2.0 (Spark SQL generated in version 1.1.0)

Apache Spark 1.2.0 on YARN with HDP 2.2 example program in this

HDP2.2 supports Spark1.2.0, waits for test features, especially spark SQL, to understand the current version of the bug in advance

Data source Support:

The External data source API supports a variety of simple formats, such as JSON, Avro, and CSV, while also enabling intelligent support for Parquet, Orc, and so on, and through this API, Developers can also use JDBC to connect external systems such as hbase to spark. The external data source can be attached as a temporary table in the file system, reducing the full load of data over the tangled

Save Results:

Unified Load/save API

In Spark 1.2.0, there are not many convenient options to save the results in Schemardd. Some of the common ones include:

Rdd.saveasparquetfile (...)
Rdd.saveastextfile (...)
Rdd.toJSON.saveAsTextFile (...)
Rdd.saveastable (...)

Spark SQL cache tables must be in the form of cachetable ("TableName"), otherwise you will not be able to enjoy the series of benefits of Columnstore

Using the JDBC data source and the JSON data source to join tables together to find the traffic log fot the most Recen tly registered users.

Test the performance of the HDP by testing the maximum number of versions supported by Spark, and then testing the integration of its extraction.

Spark SQL1.2 combined with HDP2.2

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More