Spark SQL1.2 combined with HDP2.2

Source: Internet
Author: User
Tags databricks

1.hbase the same Rowkey a research solution with multiple recording problems

VERSIONS = 3,hbase version inserts up to three records

There was a problem inserting the table "verticaldatatable" data from a cluster hbase into another table, there were more than 10 versions, but only 3 were inserted (or can be inserted successfully)

After the search was found because the table in the time, VERSIONS = 3 default is 3, VERSIONS is related to columnfamily so need to modify the VERSIONS property of the table ALTER TABLE {NAME = = ' columnfamily ', VERSIONS = ' 300 '} Useful:If the versions = 1 is inserted only one version, so that you can avoid the same rowkey in the case of duplicate records but usually when we do scan query: HBase in the use of Timerange and version
HBase (main):079:0> create ' scores ',{name=> ' Course ', versions=>2}//version 2hbase (main):080:0> put ' scores ', ' Tom ', ' Course:math ', ' the ':082:0> ' hbase (main) put ' scores ', ' Tom ', ' Course:math ', ' The ' HBase ' (main):026:0> scan ' Scores ' Row Column+cell Tom Column=course:math, timestamp=1394097651029, value=100 1 ROW (s) in 0.0110 seconds// The default scan out of hbase results in the last timestamp of the record HBase (main):032:0> scan ' scores ', {Versions=>2}row Column+cell Tom column=course: Math, timestamp=1394097651029, value=100 Tom column=course:math, timestamp=1394097631387, value=97 1 row (s) in 0.0130 sec onds//isolated Two records hbase (main):029:0> alter ' member ',{name=> ' info ', ' VERSIONS ' =>2}//modify VERSIONS
2.hive in-table deduplication solution
Insert Overwrite table store    select T.p_key,t.sort_word from       (select P_key,             sort_word,             row_number () Over (distribute by P_key Sort by Sort_word) as RN from       store) t       

A typical table in hive, in addition to the weight of the wording, P_key for the addition of weight basis, Sort_word for the sort basis, generally for the time RN ranking.

2. Research solutions for simultaneous extraction of multiple data sources using Sparksql historical data (DBMS) and Big data platform

On Spark Submit 2014, Databricks announced the abandonment of shark's development and turned to spark SQL on the grounds that shark inherited too much hive and optimized bottlenecks

March 13, 2015 Databricks release version 1.3.0, the biggest highlight of this release is the newly introduced Dataframe API reference here

Currently HDP has support for Spark 1.2.0 (Spark SQL generated in version 1.1.0)

Apache Spark 1.2.0 on YARN with HDP 2.2 example program in this

HDP2.2 supports Spark1.2.0, waits for test features, especially spark SQL, to understand the current version of the bug in advance

Data source Support:

The External data source API supports a variety of simple formats, such as JSON, Avro, and CSV, while also enabling intelligent support for Parquet, Orc, and so on, and through this API, Developers can also use JDBC to connect external systems such as hbase to spark. The external data source can be attached as a temporary table in the file system, reducing the full load of data over the tangled

Save Results:

Unified Load/save API

In Spark 1.2.0, there are not many convenient options to save the results in Schemardd. Some of the common ones include:

    • Rdd.saveasparquetfile (...)
    • Rdd.saveastextfile (...)
    • Rdd.toJSON.saveAsTextFile (...)
    • Rdd.saveastable (...)

Spark SQL cache tables must be in the form of cachetable ("TableName"), otherwise you will not be able to enjoy the series of benefits of Columnstore

Using the JDBC data source and the JSON data source to join tables together to find the traffic log fot the most Recen tly registered users.

Test the performance of the HDP by testing the maximum number of versions supported by Spark, and then testing the integration of its extraction.

Spark SQL1.2 combined with HDP2.2

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.