Three big data tools to compete, who is the real king

Last Update:2016-07-14 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

There is a saying in the industry that SQL, while proven in the field of big data analysis, is helpless and superseding, and SQL is obsolete compared to the hottest Hadoop. This is a bit of an exaggeration, and many projects now store Hadoop as data and then use SQL for front-end queries. This means that Hadoop requires support for an advanced query language. Hadoop MapReduce can perform data analysis, but it is too complex. As a result, developers developed SQL-like pig and hive.

In the big data age, we have a lot of query tools to choose from. While SQL occupies an absolute advantage, with the continued warming of big data, Apache Pig and Hive have a lot of room to play. 工欲善其事 its prerequisite, if you choose the right platform and language, will make the data extraction, processing and analysis to achieve a multiplier effect. In the future, data will become bigger and data analysis must be easier to manipulate. Fast processing and simple operation must be a major trend in big data analytics.

Apache Pig,apache hive and SQL are today's leading big data tools. They all have advantages, so let's start by introducing Apache Pig, Apache Hive, and SQL.

　　Sql

Structured Query Language (SQL) is the best companion for programmers, primarily for processing and extracting data. Big data changes the way data is processed and visualized. But SQL's strict relational database schema and declarative features remain the benchmark for data analysis. Although the SQL market is broad, big data also challenges the functionality and performance of SQL.

　　Pig

Apache Pig is suitable for programmers with SQL backgrounds and has the following two features:

1. Relaxation of data storage requirements

2. Can manipulate large datasets

Apache Pig was developed in 2006, in addition to the above features, it also has good scalability and performance optimization. Apache Pig allows developers to track multiple query methods, reducing the duplication of data retrieval. It supports composite data types (MAP, Tuple, Bag), and supports common data operations such as filtering, sorting, and joins. These features of Apache Pig are recognised by users around the world, and even Yahoo and Twitter are using Apache pig.

　　Hive

Although Apache Pig performs well, it requires programmers to master knowledge outside of SQL. Hive is very similar to SQL, although the Hive Query Language (HQL) has some limitations, but it is still very useful. Hive provides a good open source implementation for MapReduce. It behaves well in terms of distributed processing data, unlike SQL, which requires strict adherence to patterns.

Data extraction, processing and analysis without a surefire plan, you need to synthesize a variety of factors to choose, such as data storage methods, programming language structure and expected results. Let's compare pig, Hive, and SQL to see what kind of scenarios they fit into each other.

　　Pig VS SQL

SQL runs faster than MapReduce (pig runs on the Piglatin platform) in the DBMS system. However, RDBMS data loading is challenging and difficult to set up. Piglatin has the advantage of declarative execution plans, ETL processes, and pipeline modifications.

To a large extent, SQL is a declarative language, and Piglatin is a procedural language. SQL mainly specifies the completion of the object, that is, to complete the "what", and pig is mainly to develop the way to complete, that is, "how to" perform a task. Before execution, the pig script is converted to a mapreduce task. However, the pig script is shorter than the corresponding MapReduce task and significantly shortens development time.

　Hive VS SQL

SQL is a common database language that is widely used for transactional and analytic queries. Hive is designed with data analysis in mind, which also determines that hive will lack the ability to update and delete, but is capable of reading and processing massive amounts of data. Hive is very similar to SQL, and the main difference is that hive lacks the update and delete capabilities.

Although hive differs from SQL, you can smoothly transition to hive if you have a SQL background. Also, be aware of the differences in structure and grammar between the two.

I believe you have a certain understanding of pig, hive, and SQL through the above introduction to them, so let's introduce their specific application scenarios.

　Apache Pig's application scenario

Apache Pig is suitable for unstructured datasets and can take full advantage of SQL. Pig does not need to build a mapreduce task, and if you have a background in SQL learning, Getting started is very fast.

　　Apache Hive Application Scenario

Many businesses need to analyze historical data, and hive is a powerful tool for analyzing historical data. But Hive can only show its divinity in the case of structured data. Hive's weakness is real-time analysis, and hbase can be used if you want to perform real-time analysis.

　Application Scenarios for SQL

SQL is the oldest of the three data analysis tools, with the constant change in user needs, SQL is also constantly updating itself, is still a tool with the times. For a professional data analyst, there is no doubt that SQL is better than Excel, but it still has short boards for fast processing and analyzing data. If the data requirements are not very demanding, SQL is a good choice, and its versatility and flexibility are recognized by developers. Since the vast majority of developers are familiar with SQL, they can get started right away, while SQL also provides some extensions and optimizations that can be customized to suit the needs of the product.

There is no single tool available for all data, and SQL, pig, and hive have their own scenarios, so the tools that fit your scenario are the best tools.

Original source: Http://www.hadoop360.com/blog/pig-vs-hive-vs-sql-difference-between-the-big-data-tools

Three big data tools to compete, who is the real king

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More