Actual combat for Wang Liang solution sql-on-hadoop Difficult diseases

Last Update:2014-12-22 Source: Internet

Author: User

Keywords This we very at present can

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

March 13, 2014, CSDN online training in the first phase of the "use of Sql-on-hadoop to build Internet Data Warehouse and Business intelligence System" successfully concluded, the trainer is from the United States network of Liang, In the training, Liang shares the current business needs and solutions of data warehousing and business intelligence systems in the Internet domain, Sql-on-hadoop product principles, usage scenarios, architectures, advantages and disadvantages, and performance optimization.

CSDN Online training is designed for the vast number of technical practitioners in the online real-time interactive technology training, inviting all industry first-line technical engineers to share their work encountered in various problems and solutions, at the same time to bring you some new technologies, ideas, solutions!

CSDN Online training to "classical, practical, systematic, forward-looking, professional" characteristics of the curriculum, through video lectures, document sharing, whiteboard sharing, screen sharing, instructor online QA and other forms of learning methods to help first-line engineers to use debris time to enhance the actual combat ability, improve the level of practice, and to achieve with the technical Master of Communication and interaction.

Due to the limited time of this training, the question and answer link has many problems lecturer has no time to reply, csdn specially prepared the training QA summary, to help us better review and summary of the training to learn the technical points, so that students faster grasp the relevant technical points, less detours. Here are some QA, more questions to csdn this event discussion posts to interact: http://bbs.csdn.net/topics/390731622

Q: Is the standard SQL written on the Impala?

We now include Hive,impala and so we are using the Sql-on-hadoop products are not using very standard SQL, we all know to a distributed computing environment in the implementation of standardized SQL, we need a new distributed search engine. This we know like Greenplum, GROUP by for standard SQL support is very good, like Impala for standard SQL support is not very good. In fact, the organization using Impala is mainly the internet, because the Internet companies will have a lot of IT staff, they will be based on Impala to do some of their own business development, at this time he can not use standardized SQL, because they focus on performance.

Impala currently supports most ANSI SQL 92 standards, temporarily does not support subquery, exist and set twist. But this is already in Impala's roadmap, and is expected to support Impala 2.0. Requirements like subquery can be implemented by changing SQL to join rewrite.

Q: In the case of Hive and Impala at the same time, Impala seems unable to get hive-metasotre updates in real time. We have to restart Impala-server.

This issue has been revised in the latest Impala version 1.2.3. Previously need to refresh or invalid metaData such an explicit command to update MetaData, the new version has a more than one CATALOGD service to push metadata updates to each node.

Q: Ask a question, now there are a large number of logs stored in the HDFs, the interpretation of these logs need to use some dimension table, stored in MySQL, how should write map reduce program to do the parsing of these logs.

This is a typical SQL on Hadoop usage scenario, and I also mentioned the solution in PPT. In general, through the Sqoop task to the MySQL inside the table into the HDFs, and then on the HDFs and log for a variety of query operations. You can use MapReduce or hive, and you can recommend using hive, because the data you import from MySQL is structured.

Q: In the real world, the average company will remain in HDFs for a large amount of historical data, or it will choose a different bottom cost archiving method

As far as I know, many historical data are still stored in HDFs, and some companies use tape storage as backup. HDFs is relatively inexpensive compared to EMC or NetApp's backup storage offerings. Because HDFS we typically store data in three replica, there is a waste of resources for backup storage. The measures that can be considered are HDFS archive or using a similar approach to erasure code to conserve storage resources. Both Facebook and Taobao use the idea of erasure code to store large amounts of infrequently used historical data.

Q: What companies are currently using Impala in China? What is the use of the scene?

At present, I understand the domestic Alibaba and Baidu are using Impala. Specific use of the scene can refer to me and Ali's Yang Zhuo Luo classmate cooperation of an article "based on Impala real-time large data query system practice", there are more detailed introduction.

Q: Can you talk about your understanding of data scientist? What kind of person can be called data scientist? Is it different from the data engineer and the people who do the data mining/machine learning?

It is recommended that you look at a very short book <<building Data Science team>>. My understanding of the data scientist is to have the use of the method and ideas to solve the business problem, will be mathematical and statistical methods and computer programming to achieve. Data mining and machine learning are basically some basic skills that data scientist must master.

Q:flume collected logs more and more log files over time, what are the methods such as 1 days to produce a file

Flume collects logs onto the HDFs and then carries out ETL operations according to the logic you specify. Storage on the flume agent is temporary storage, so the number of files produced is not problematic.

Q: What is the main specific application scenario in the Internet company? For example, collect Nginx log can be specific to do what application?

For example, users browse those pages, merchandise, on which channels stay longer and so on user behavior analysis. Log is the basis of all user analysis.

Q: May I ask the United States network internal HBase is doing OLAP, OLTP, or with distributed MySQL?

The American Regiment interior HBase is does online storage, this is uses the scene and the typical Internet company is same. At present is using HBase to do OLAP company is not particularly many, and as far as I know the effect is not a special number. The current OLTP domain is also the world of Oracle and MySQL.

Q: Traditional datawarehouse for different applications, the use of different layer developed olap,mining, such as the implementation of rapid query, Hadoop for this can support real-time query?

For different applications, collar different layer this in the Hadoop ecosystem also no problem, because this is the Data Warehouse architecture and Data Warehouse logic design problem, more should consider the company's business logic, with what platform is not tightly coupled. Hadoop has a lot of real-time query tools Tez/impala/shark.

Q:real-time scene, Shark and impala contrast? Like Shark more?

I do not know this conclusion. Now the companies are in the initial stage of trial, Impala and shark each have their own advantages and disadvantages.

What are the pros and cons of Q:impala and Ali Mdrill?

Ali's Mdrill I have not used, just read some information, I talk about the personal point of view. Impala Positioning is an interactive query. And Mdrill is "high dimensional + real-time" query, Mdrill query the underlying data will do a lot of preprocessing work, so the dynamic of the data is not as good as Impala. The dimensions that Mdrill can query are very, very high.

Q: For Data Warehouse, to do up and down drilling, Impala, how to design the table to meet the needs? Is it a fact table for all possible dimension combinations? or only the smallest granularity of the table storage, if there is aggregate result requirements, in the query is the current calculation?

This logic is also the problem of data Warehouse design, based on your business logic and access patterns to develop, with the use of what kind of platform relationship is not particularly close. All possible dimension combinations are obviously inappropriate for the fact table, or are we always talking about balance

Q: When using MapReduce to do data cleaning, how does the dimension table data load or store? Is it in memory, if it's in memory, if it has shared memory, or if you want every machine to generate a cache

This is a very good question. The new features supported by HDFS 2.3 will allow the upper application to explicitly specify which machine's memory to put some data on, and this feature is specifically intended for applications such as Hive/hbase/impala. However, currently Hive/hbase/impala does not support this feature, if you write the MapReduce program can explicitly specify the dimension table loaded into memory.

Q: Now our products to the stability of the requirements in the first place, can not always abnormal, downtime, what platform is the most stable? Hortonworks? CDH? or other platforms?

To tell the truth which is not particularly stable, but from my experience CDH should be relatively good, and CDH support functions will be more perfect, the response to the bug faster.

Q: I was in a small institute, we have a variety of servers, are minicomputer, IBM,HP, Lenovo, Dell have, if you do this software and hardware architecture, how feasible, at the technical level, including human technology input and other hardware input, etc. so after the completion of the above can be better to the various web sites to integrate it? This software architecture is suitable for the integration of many sizes of web sites and systems

This problem is a problem that many traditional companies will encounter when migrating to new platforms like Hadoop. If you simply integrate existing equipment, then there should be little problem. The bigger question, though, is whether a platform like Hadoop can solve your problems. Hadoop is a large data storage and analysis platform, not the site backstage ...

Q: Does the landlord have heard of data Vault database modeling method. Last year, we studied and attempted to use this modeling idea on an RDBMS. Our understanding is that the DV modeling method will keep all the information that has occurred in history from being kept in the data warehouse, when necessary through data loading time (load date) or data effective time (starttime), Data aging Time (Endtime) distinguishes the same business at different points in time.

In theory, the DV modeling method is very suitable for mr/impala this distributed architecture, but how to efficiently and standardize the data loading, data weight, data lifetime calculation (data effective/aging time) in Hadoop has been bothering us. If you update a large amount of data each time, we can accept the full volume data operation. But if only a small amount of data is updated, there is a feeling that the above method is expensive. is the landlord able to share some of this experience?

That's a good question, but it's also about specific business issues. On the issue of total update and incremental update, when your data is not very large, the full update is simple and efficient; When your data is very large, you need to disassemble the incremental update, Load Date/start time/end time is a common way to solve our ideas, The specifics of how to disassemble an incremental update depends on what your business model is.

Q: Is the sql-on-hadoop environment complicated? How many servers are required at least? Is there a simpler way to set up a sql-on-hadoop environment? Mainly I am not very familiar with Linux,

The simplest idea is to use the CDH and Cloudera manager provided by Cloudera to complete a quick build of a hadoop system that can refer to the Cloudera company's information.

Miss the Internet wave, missed the era of electric business competition is not a pity, because we catch up with cloud computing and the rise of large data. Faced with huge cloud computing and large data technology talent gap, internet companies and traditional industry companies are competing for resources at all costs: share incentives, double pay, the top 60 months of the year-end award, has been staged in the past. In addition to the limited rewards, there is an unlimited entrepreneurial boom with technological strength as its capital. Technical elite, has ushered in a rare transformation and value-added opportunities! Pay close attention to CSDN online training!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More