Constructing Internet Data Warehouse and business intelligence system with Sql-on-hadoop
Source: Internet
Author: User
KeywordsData Warehouse business intelligence building Mutual at present very
Big data is now a very hot topic, SQL on Hadoop is the current large data technology development in an important direction, how to quickly understand the mastery of this technology, CSDN specially invited Liang to do this lecture for us. Using Sql-on-hadoop to build Internet Data Warehouse and business intelligence system, through analyzing the current situation of business demand and sql-on-hadoop, this paper expounds the technical points of SQL on Hadoop in detail, shares the experience of the first line, and helps the technicians to master the relevant technical points faster. , less detours.
For an engineer or analyst, how to query and analyze TB/PB level data is an unavoidable problem in the big data age. SQL on Hadoop becomes an important data analysis and mining tool. There may be a question at this point, why do I have to put SQL on Hadoop? Because SQL is easy to use; Why should it be based on Hadoop? Because the Hadoop architecture is robust and scalable.
Liang that the value of data is a valuable asset that all businesses can see, and that the core of large data is the analysis of effective data from massive amounts of data, thus creating value using effective data. In the Internet enterprise and the traditional enterprise with large data processing demand, the data sources of data Warehouse based on Hadoop are mainly data collected through Apache/nginx log, user and business data stored in Oracle/mysql, Data imported from other external DW data sources through ETL tools. He talked about the fact that all of the SQL on Hadoop products are actually available in one or some specific areas, and there is no one-size-fits-all product. It is unrealistic to have a product that meets almost all enterprise-class applications like Oracle/teradata in the big Data age. So every SQL on Hadoop product is trying to meet the characteristics of a particular type of application.
In the case of Hive and Impala, Hive is the most commonly used solution for large data and data warehouses in internet companies, and even in many companies, the Hadoop cluster is not designed to run native MapReduce programs and is used to run hive SQL query tasks.
For companies with a lot of data scientist and analyst, there are many query requirements for the same table. So it's obvious that everyone is looking at data from the hive slow and wasteful. It would be much more efficient to put frequently accessed data into a memory-composed cluster for user inquiries. Facebook has developed a Presto for this demand, a system that puts hot data in memory for SQL queries. This design idea is very similar to Impala and Stinger. A simple query with Presto takes only hundreds of milliseconds, even a very complex query that takes only a few minutes to run, runs in memory, and does not write to disk. More than 850 of Facebook's engineers use it to scan more than 320TB of data every day, satisfying 80% of the hoc query requirements.
Impala can be seen as a hybrid of the Google Dremel Architecture and MPP (massively Parallel 處理) architecture.
The main Cloudera is currently leading the project. Baidu, for example, Baidu tried to access the impala of MySQL as a storage engine, while implementing the corresponding operation of the planfragment, then the user to query or according to the original analytic method to resolve into a variety of planfragment, and then directly dispatched to the corresponding node (HDFS datanode/hbaseregionserver/mysql). Will put some source data or intermediate data into MySQL, user's query involves using this part of the data directly to MySQL to take.
Liang analyzes the pros and cons of various SQL on Hadoop products and their scope of application from the technical framework and the latest development two: Hive, Tez/stinger, Impala, Shark/spark, Phoenix, hdapt/hadoopdb, hawq/ Greenplum. Liang the 7 newest technologies from the principle of the product, the use of the scene, architecture, advantages and disadvantages, performance optimization and other aspects of the in-depth elaboration. Article details click: the latest development of SQL on Hadoop and 7 related technology sharing
The CSDN online training: "Building an Internet Data warehouse and business intelligence system with Sql-on-hadoop", Liang describes the business needs and solutions currently being built in the Internet domain data warehousing and business intelligence systems. Sql-on-hadoop product principles, use scenarios, architecture, advantages and disadvantages, performance optimization. Finally, we will introduce a few practical cases to help you understand the Internet Data Warehouse and Sql-on-hadoop products. The technical points to be covered in this training are: Hadoop, Hive, Impala, Shark, Flume, Oozie, Sqoop, zookeeper, HBase, Tableau, MicroStrategy frameworks, and comparison of the advantages and disadvantages, Use the scene, the current application case in the enterprise, but also bring several common solutions and comparisons!
This on-line training uses the three-minute screen mode, can interact with the instructor at the same time in the lecture, lets you feel the real classroom environment. Still having trouble learning about the "craftsmanship" of Hadoop? Still having headaches for Hadoop enterprise applications? Come and have a look!
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.