The Sunflower Treasure Store for Hadoop files

Source: Internet
Author: User
Tags hadoop ecosystem


File Storage branch storage and column storage, each storage format is divided into different types, in the actual application of how to use? How to use? Come and watch the crowd!

File storage format, when are we going to specify it? For example, in hve and Ipala to create a table, in addition to specifying columns and separators, at the end of its command line has the stored as parameter, this parameter is the default text format, but the text is not suitable for all scenes, then here we can change the text information.

650) this.width=650; "Src=" Http://s2.51cto.com/wyfs02/M02/8C/3B/wKiom1hl7VzS32exAACIcZ7sbpA038.png-wh_500x0-wm_3 -wmp_4-s_985216941.png "title=" 11.png "alt=" Wkiom1hl7vzs32exaacicz7sbpa038.png-wh_50 "/>

So what formats should we choose? What are the characteristics of each of these formats? Why do we have to choose this format?

A. text file:

Text files are the most basic types of files in Hadoop, and can be read or written from any programming language, and are compatible with comma and tab-delimited files and many other applications. And the text file is readable, because it's all strings, so it's very useful in Debug. However, the data reaches a certain scale, this format is very inefficient: (1) The text file represents a value as a string waste storage space, (2) It is difficult to represent binary data, compared to tablets, usually rely on other technologies, such as BASE64 encoding

So the text file format is summed up: easy to operate, but low performance

Second, sequence file

The sequence file essence is a binary container format based on Key-value key-value pairs, which is less redundant and more efficient than text format, and is suitable for storing binary data, compared to slices. And it's a Java proprietary format and tightly integrated with Hadoop.

So the sequence file format is summed up: good performance, but difficult to operate

Iii.. Avro data File

Avro data files are binary encoded and are more efficient to store. Not only can it be widely supported in the Hadoop ecosystem, it can also be used outside of Hadoop. It is the ideal choice for long-term storage of important data that can be read and written in multiple languages.

And it is embedded in the schema file, through this file we can easily like a table to define the data pattern, you can flexibly develop the field and field types. Schema evolution can adapt to various changes, such as the current designation of a schema type, in the future to add some data structures, delete some data, the type has changed, the length has changed, can be addressed.

So the Avro data file format is summed up: excellent operability and performance, is the best choice for Hadoop Universal storage.

The three formats described above are row-store, but Hadoop also has a few columnstore formats. Typical OLTP is stored as rows, which are stored in contiguous blocks with contiguous rows, and when we do random access to the values, we usually add conditions that, for row storage, can be quickly defined to the block's location and then fetch the row's data. The column store is stored as a unit, and if the columnstore is applied to OLTP we want to define a scan to a specific row, it scans all the columns. It is a horrible thing for a columnstore to be applied to an online transaction scenario, and the significance of Columnstore is that it is applied to big data analysis scenarios, such as extracting eigenvalues, filtering variables, and often in large data scenario applications, we will apply a large number of wide tables, possibly for a business analysis, We only need to use one or dozens of of these columns, then we can select some columns to scan, not to scan the whole table. Row and column storage are not absolutely good or bad, only the scenarios that apply to each other are different.

650) this.width=650; "Src=" Http://s5.51cto.com/wyfs02/M01/8C/38/wKioL1hl7WvhHx9eAAF74ogZLZo115.png-wh_500x0-wm_3 -wmp_4-s_2207139537.png "title=" 22.png "alt=" Wkiol1hl7wvhhx9eaaf74ogzlzo115.png-wh_50 "/>

Let's look at one of the following storage important storage methods:

I.. parquet file

Parquet The file format is very important and will be widely used in the future. If we call HDFs the Big Data storage fact standard, then the parquet file is the de facto standard for the file storage format. Currently, Spark has used it as the default file storage format, which can be seen in its importance. The open source Columnstore format, originally developed by Cloudera and Twitter, supports applications in MapReduce, Hive, Pig, Impala, Spark, Crunch, and other projects. It and Avro data files have schema metadata, except that parquet files are Columnstore, Avro data files are row storage. It is important to emphasize that the Parquet file has some additional optimizations in terms of coding, reducing storage space and increasing performance.

So the parquet file is summed up as: excellent operability and performance, is the best choice based on the column access pattern.

file storage format, need to focus on grasping and learning, especially the advantages and disadvantages of each storage format, must be skilled in the use of better to choose to use. In addition, we also in the ordinary work to share more with others, so as to better improve their knowledge structure, improve their technical level, friendship recommended "Big data CN" public number, and so you come to exchange!


This article is from the "11872756" blog, please be sure to keep this source http://11882756.blog.51cto.com/11872756/1887635

The Sunflower Treasure Store for Hadoop files

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.