1.hbase the same Rowkey a research solution with multiple recording problems
VERSIONS = 3,hbase version inserts up to three records
There was a problem inserting the table "verticaldatatable" data from a cluster hbase into another table, there were more than 10 versions, but only 3 were inserted (or can be inserted successfully)
After the search was found because the table in the time, VERSIONS = 3 default is 3, VERSIONS is related to columnfamily so need to modify the VERSIONS property of the table ALTER TABLE {NAME = = ' columnfamily ', VERSIONS = ' 300 '}
Useful:If the versions = 1 is inserted only one version, so that you can avoid the same rowkey in the case of duplicate records but usually when we do scan query: HBase in the use of Timerange and version
HBase (main):079:0> create ' scores ',{name=> ' Course ', versions=>2}//version 2hbase (main):080:0> put ' scores ', ' Tom ', ' Course:math ', ' the ':082:0> ' hbase (main) put ' scores ', ' Tom ', ' Course:math ', ' The ' HBase ' (main):026:0> scan ' Scores ' Row Column+cell Tom Column=course:math, timestamp=1394097651029, value=100 1 ROW (s) in 0.0110 seconds// The default scan out of hbase results in the last timestamp of the record HBase (main):032:0> scan ' scores ', {Versions=>2}row Column+cell Tom column=course: Math, timestamp=1394097651029, value=100 Tom column=course:math, timestamp=1394097631387, value=97 1 row (s) in 0.0130 sec onds//isolated Two records hbase (main):029:0> alter ' member ',{name=> ' info ', ' VERSIONS ' =>2}//modify VERSIONS
2.hive in-table deduplication solution
Insert Overwrite table store select T.p_key,t.sort_word from (select P_key, sort_word, row_number () Over (distribute by P_key Sort by Sort_word) as RN from store) t
A typical table in hive, in addition to the weight of the wording, P_key for the addition of weight basis, Sort_word for the sort basis, generally for the time RN ranking.
2. Research solutions for simultaneous extraction of multiple data sources using Sparksql historical data (DBMS) and Big data platform
On Spark Submit 2014, Databricks announced the abandonment of shark's development and turned to spark SQL on the grounds that shark inherited too much hive and optimized bottlenecks
March 13, 2015 Databricks release version 1.3.0, the biggest highlight of this release is the newly introduced Dataframe API reference here
Currently HDP has support for Spark 1.2.0 (Spark SQL generated in version 1.1.0)
Apache Spark 1.2.0 on YARN with HDP 2.2 example program in this
HDP2.2 supports Spark1.2.0, waits for test features, especially spark SQL, to understand the current version of the bug in advance
Data source Support:
The External data source API supports a variety of simple formats, such as JSON, Avro, and CSV, while also enabling intelligent support for Parquet, Orc, and so on, and through this API, Developers can also use JDBC to connect external systems such as hbase to spark. The external data source can be attached as a temporary table in the file system, reducing the full load of data over the tangled
Save Results:
Unified Load/save API
In Spark 1.2.0, there are not many convenient options to save the results in Schemardd. Some of the common ones include:
- Rdd.saveasparquetfile (...)
- Rdd.saveastextfile (...)
- Rdd.toJSON.saveAsTextFile (...)
- Rdd.saveastable (...)
Spark SQL cache tables must be in the form of cachetable ("TableName"), otherwise you will not be able to enjoy the series of benefits of Columnstore
Using the JDBC data source and the JSON data source to join tables together to find the traffic log fot the most Recen tly registered users.
Test the performance of the HDP by testing the maximum number of versions supported by Spark, and then testing the integration of its extraction.
Spark SQL1.2 combined with HDP2.2