12 technical pain points for Hadoop

Source: Internet
Author: User
Keywords Cloud computing Hadoop

Chapter author Andrew C. Oliver is a professional software advisor and president and founder of the Open Software re-programme of North Carolina State Dalem data consulting firm. Using Hadoop for a long time, he found that 12 things really affected the ease of use of Hadoop.

Hadoop is a magical creation, but it develops too quickly and shows some flaws. I love elephants and elephants love me. But there is nothing perfect in this world, and sometimes even good friends clash. Just like the struggle between me and Hadoop. Here are 12 pain points I've listed.

1. Pig vs. Hive

You can't use Hive UDFS in Pig. In Pig you have to use Hcatalog to access Hive tables. You can't use pig UDFS in Hive. No matter how small the extra features in Hive, I don't feel like writing a Pig script or "Ah, if I can do it easily in Hive, especially when I write the Pig script, when I'm writing one of them, I often think," If only I could skip this wall! ".

2. Forced to store all my shared libraries to HDFS

This is the relapse mechanism of Hadoop. If you save your Pig script to HDFS, it automatically assumes that all the JAR files will be there for you. This mechanism also appears on Oozie and other tools. This usually doesn't matter, but sometimes it's painful to have to store a shared library version of an organization. And, most of the time, you install the same JAR on different clients, so why save it two times? This was fixed in the Pig. Where else?

3. Oozie

Debug is not fun, so there are a lot of old-fashioned examples in the documentation. When you encounter a mistake, you may not be doing something wrong. It is possible to configure print errors or format validation errors, collectively known as "Protocol errors." To a large extent, Oozie is like Ant or Maven, and in addition to being distributed, there is no need for tools and a bit of error.

4. Error message

You're joking, right? Speaking of error messages. My favorite is what any Hadoop tool says, "failure, no error return," can be translated into "what happened, can find is your luck." ”

5. Kerberos Identity Authentication Protocol

If you want to think of a relatively safe Hadoop, you'll need Kerberos. Remember Kerberos and how old is it? So you just LDAP, except it's not integrated in Hadoop: No single sign-on, no SAML, no OAuth, no certificate delivery (on the contrary, it will be certified). More interestingly, Hadoop is a part of the ecosystem that writes its own LDAP support, so this is contradictory.

6. Knox Data Protection Application

Because it takes at least 100 times to write a suitable LDAP connector in Java to make sure it is correct. Gee, look at that code. It does not really effectively maintain the connection pool. In fact, I think Knox is created for Java or a momentary passion. You can do the same thing by writing a good Apache config,mod_proxy,mod_rewrite. In fact, that's the basis of Knox, except in Java. For startup, after authentication, it does not pass information to Hive or Webhdfs or what you are accessing, but it starts.

7. Hive will not let me make the external table, but it will not delete it

If you have Hive to manage the form, if you terminate the use of the form, it will automatically delete all of them. If you have an external table, it will not delete it. Why not have a "also delete external table" feature? Why do I have to delete it externally? And when Hive, especially with RDBMS, why not have the Update and delete features?

8. Namenode failure

Oozie, Knox, and other parts of Hadoop do not follow the new Namenode HA data. You can enable HA in Hadoop as long as you don't use something related to it.

9. Documentation

Complaining is a cliché, but check it out. 37 lines are wrong--and worse, all the articles on the Web are wrong. This proves that no one has a fee to check before execution. The Oozie document is even scarier, and most examples fail to pass the format checksum it gives it.

Ambari coverage

I've criticized Ambari, it's amazing how Ambari can work for the Hadoop architecture I know. So, they might complain, where is Ambari's weakness? For example, Ambari cannot be installed, or in some cases not properly installed, including a variety of HA settings, Knox, and so on. I'm sure it will get better, but "after manual installation" or "we've created a backup script," These should not appear in my messages and documents.

11. Knowledge Base Management

Speaking of Ambari, when the knowledge is being upgraded, have you finished installing it? I did, but it didn't perform well. In fact, sometimes it finds the fastest mirror image. Ambari is not concerned about the compatibility of what it downloads. You can configure that part in your own way, but it will still report that you installed incorrectly on hundreds of Hadoop nodes.

Null Pointer exception

I often encounter such conversion errors during the run, in other words, they should not be represented as Null pointer exceptions in pig, Hive, and other data query and processing tools. For any similar complaints, there will be replies, "Welcome patch!" Or "Hey, I'm working on it." ”

Hadoop has been out for a long time and it's always been one of my favorite tools, but the maddening, sharp questions make me angry. Just hope that developers can solve these problems more attentively. Don't know if you have similar Hadoop bugs that can be shared with you in order to urge Hadoop to do better!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.