BackgroundWith the advent of the Big Data era, more and more data flows to the Hadoop ecosystem, while the ability to get valuable data from terabytes and even petabytes of data is even more important for a product and company, and a number of open source data analysis engines emerge during the rapid development of the Hadoop ecosystem , such as Hive, Spark SQL, Impala, Presto and so on, but also produced a
Tags: java se javase roc ring condition ADA tle related diffOne: Parquet use best practices for Spark SQL 1, in the past the entire industry of big data analysis of the technology stack pipeline generally divided into two ways: A) Result Service (can be placed in db), Sparksql/impala, HDFs parquet, HDFs, Mr/hive/spark (equivalent ETL), Data Source , may also be used as a data service via JDBC/ODBC); B) Resu
Compared with the row-based storage engine of traditional databases, the column-based storage engine has a higher compression ratio and less I/O operations, especially when there are many data columns, however, each operation only queries and computes several columns, and the column-based storage engine is more cost-effective. Currently, in open-source implementations, the most famous columnar storage engines are Parquet and ORC, and they are both Apa
I. Environmental descriptionHadoop cluster: Using the test Hadoop cluster, node:hadoop230hadoop231hadoop232hadoop233These machine configuration, the specific parameters can be referred to as follows:Number of CPUs: 2Number of CPU Threads: 32Memory: 128GBDisk: 48TBWith the same queue on the test fleet, all queries are non-concurrent, using the entire cluster's resources.Hive uses the official hive 1.2.1 version, launched using the Hiveserver2 method, u
Brief introductionApache Parquet is a columnstore format used primarily for the Hadoop ecosystem. Regardless of the data processing framework, model, and programming language. Cloudera's Big Data Online analysis (OLAP) project uses this format as a column store in Impala. Parquet is a columnstore internal to Twitter, currently open source and hosting the code in
The Columnstore engine has a higher compression ratio, fewer IO operations than a traditional database row storage engine, especially when there are many columns of data, but each time the query and calculation is done for only a few columns, the Columnstore engines are more cost-effective.Currently in the open source implementation, the most famous Columnstore engine is parquet and Orc, and they are the top projects of Apache, playing an important ro
In order to optimize the performance of various tools prior to MapReduce and MR , a number of different storage methods have emerged in the Hadoop built-in data storage format. such as optimizing the rcfileof Hive performance, and with Impala to achieve Google Dremel features ( similar to even the feature of the superset ) parquet and so on. Come and study together today. The evolution of data storage in HD
1.spark Read the compressed file of HDFs GZ
spark1.5 later versions support direct reading of files in GZ format, no difference from reading other plain text files.Start the spark shell interface and read the GZ file in the same way as a plain text file:
Sc.textfile ("/your/path/*.gz"). map{...}
The above code will take care of the need to read GZ compressed files. 2.spark Read parquet format file
Spark naturally supports files in
Reprinted from Dong's Blog The Columnstore engine has a higher compression ratio and less IO operations than the traditional row storage engine (note: Columnstore is not all-powerful, many scenarios are still more efficient), especially in the number of data columns (column), but each action is only for a few columns of the story, The Columnstore engine is more cost effective. In the Internet Big Data application scenario, in most cases, the data volume is very large and the number of data field
There is a problem with spark SQL 1.2.x:When we try to access multiple parquet files in a query, if the field names and types in these parquet files are exactly the same, except for the order of the fields, for example, a file is name string, id int, and the other file is ID int, name String, the query will error, throwing an exception to the metadata merge.In 1.3, this problem has actually been solved. The
Databases"). ShowSwitch database after successful creationSpark.sql ("Use Spark") Now start reading remote MySQL data Val sql = "" "CREATE TABLE student USING org.apache.spark.sql.jdbc OPTIONS ( ur L "Jdbc:mysql://worker2:3306/spark", dbtable "student", User "root", password "root " )"""Perform:Spark.sql (SQL);The table data is cached after waiting for execution to completeSpark.sql ("Cache table student")You can do this at this time, for example: val studentdf = Spark.sql ("Select Id,
For a detailed introduction to parquet, please refer to: Next-generation Columnstore format parquet, this article describes parquet in detail, here does not repeat the introduction, but in the definition level (DL) and repeated level (RL) part, More difficult to understand, here to do a more easy to understand the summary.The understanding of DL and RL, preferabl
Tags: improve stream using HTML nbsp BSP file Dev ArticleMass data storage is recommended to replace files on HDFs with parquet ColumnstoreThe following two articles explain the use of parquet Columnstore to store data, mainly to improve query performance, and storage compressionParquet in Spark SQL uses best practices and code combat http://blog.csdn.net/sundujing/article/details/51438306How-to: Convert te
Label: DT Big Data Dream Factory Contact information: Sina Weibo: www.weibo.com/ilovepains/Public Number: Dt_spark Blog: http://.blog.sina.com.cn/ilovepains One: Rethinking the meaning of parquet under the Sparksql Storage spaces include: memory and disk, computational aspects If you say HDFS is the fact standard for the storage of distributed file systems in the era of big data, Parquet is the fact standa
Background: The data type of some fields in the Hive table has been modified, such as from String-> Double, at which point the underlying file format for the table is parquet, after the modification, the Impala index is updated, and then the fields that modify the data type appear with the Parquet Problem with schema column data type incompatibility.
such as: impala--
Extracting results for the following
); SqlContext SqlContext=NewSqlContext (SC); DataFrame USERSDF=sqlcontext.read (). Load ("Hdfs://spark1:9000/users.parquet"); Usersdf.Select("name","Favorite_Color"). Write (). Save ("Hdfs://spark1:9000/namesandfavcolors.parquet"); } }Manually specifying a data source typeYou can also manually specify the type of data source to use for the operation. The data source usually needs to be specified with its fully qualified name, such as Parquet is Org
Label:Considerations when using the Parquet storage type in the Spark1.2.0 version:SQL statements:Select * from Order_created_dynamic_partition_parquet;To perform the results in Spark-sql: the - to [email protected] [[email protected]2014-05 [Email protected] [ [email protected]2014-05 [] [ email protected] [[email protected]2014-05 [email protected] [email protected]2014-05 [Email protected]To perform the results in Beeline:Error:[
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.