Latest hadoop experiences

Source: Internet
Author: User
The difference between apache and cloudera is that apache released hadoop2.0.4aplha in April 25, 2013, which is still not applicable to the production environment. Cloudera released CDH4 Based on hadoop0.20 to achieve high namenode availability. The new MR framework MR2 (also known as YARN) also supports MR and MR2 switching. cloudera is not recommended for production.

The difference between apache and cloudera is that apache released hadoop 2.0.4aplha in April 25, 2013, which is still not applicable to the production environment. Cloudera released CDH4 Based on hadoop 0.20 to achieve high namenode availability. The new MR framework MR2 (also known as YARN) also supports switching between MR and MR2, cloudera is not recommended for Production

Differences between apache and cloudera

Apache released hadoop2.0.4aplha in April 25, 2013, which is still unavailable in the production environment.
Cloudera released CDH4 Based on hadoop0.20 to achieve high namenode availability. The new MR framework MR2 (also known as YARN) also supports switching between MR and MR2, cloudera does not recommend using MR2 in the production environment. In MR2
A Resource Manager is responsible for Resource management. Each slave Node runs a node manager, which monitors Node resources and reports them to the Resource Manager. Each new Job is named as an application, and each
The Application will be assigned an Application Master and run on the slave node. It is responsible for coordinating resources to the Resource Manager and managing the application lifecycle. In this way, the JobTracker task set in MR1 is lifted and the task is
The execution is changed from queue running to concurrent running, which makes better use of cluster resources.
Some large companies, such as Sina, have switched their hadoop clusters to the CDH4 version for production environments. At the same time, CDH4 provides Pilot installation and other methods, greatly improving the O & M capability. However, multiple
Users and directories. If an unknown problem occurs and the CDH structure is not well understood, troubleshooting may be difficult.

How to manage permissions in hive

Hadoop and hive provide limited permission control functions. However, the specific requirements of each company are not necessarily met. Therefore, you need to extend the hive permission control. In the current situation, there are three solutions available
.
1) hive0.10 can control permissions through metadata. Authorization is performed by users, groups, and roles. You can create users, groups, and roles in mysql and grant permissions.
2) control metadata. For specific hivedb, use specific mysql and other databases to store metadata. This completely isolates related operations to improve data security.
3) by extending the hive source code, create a permission management project for hive permissions, generate users, and assign different databases, tables, and partitions to users, the maximum number of MR tasks that a task can operate on and control specific columns
. There are still some risks when using the same metadata. Currently, hadoop cluster management is generally maintained by a unified Department. You can specify two types of users when configuring metadata in hive, one type has the ability to read and write, one type
Only have read-only permissions, can prevent user misoperation caused by data loss and other problems, at the same time for hadoop cluster to configure garbage collection mechanism (fs in the core-site.xml. trash. interval) to reduce the impact of accidental deletion.

How to optimize hadoop Performance

There are many articles on hadoop Performance Optimization on the Internet, all of which refer to how to optimize clusters. We cannot copy the configurations on the Internet. Because of the diversity of network environments, servers, and so on, we need to follow our own circumstances, set cluster parameters.
Lzo compression can reduce the cluster Data Storage pressure, increase the data transmission pressure from ER er to reduce, and improve the job operation efficiency.
Hadoop native supports gzip. The gzip compression ratio is much higher than that of lzo. However, during the running process, gzip can only run on one task, resulting in a significant reduction in cluster capabilities, lzo's native support for chunks greatly provides MR execution
And saves disk space, which can improve the cluster performance. Gzip compression is also applicable to some very cold data. For historical data that is not frequently used, you can use gzip compression for processing.
Processing historical data queries is slow, but it can reduce cluster disk usage. In fact, cold data may not be used once a year.

Original article address: Submit the latest hadoop articles and thank you for sharing them with me.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.