When it comes to big data, a lot of people are starting to focus on big data and Hadoop and data mining and data visualization, and I'm starting a business, and I've got a lot of questions about the companies and individuals that have come across a lot of traditional data industries to transition to Hadoop, and most of them are similar. So I want to sort out some of the issues that may be of concern to many people.
What about the Hadoop version?
So far, as a half foot forward to the Hadoop gate, I suggest that you choose the Hadoop 1.x. Many people may say that Hadoop is 2.4, why use 1.x, and say it without playing Hadoop.
Reason one: Hadoop 1.x and 2.x are totally two different things, not as simple as saying that a stand-alone webserver is upgraded from 1.0 to 2.0. It's not that I'm using MySQL 5.0, as long as compiling a new version directly seamlessly migrate to 5.5 things. Hadoop from 1.0 to 2.0 is the entire architecture system to overturn the rewrite. From the implementation to the user interface is completely two different things, do not simply think that is just like nginx from 0.8 to 1.4. So I give the suggestion is that the production environment with 1.x, experimental environment deployment 2.x as a familiar use.
Reason two: Still, Hadoop is not webserver, the distributed system despite the Hadoop implementation, but he is still a very complex set of systems, just HDFS storage, formerly Hadoop 0.20.2 want to upgrade to 0.20 203, first you need to deploy a new version of Hadoop on all nodes, and then stop all the services of the cluster, do a good job of metadata backup, and then do HDFS upgrades, can not guarantee that HDFS will be able to upgrade successfully. This upgrade once the price is very large, stop the service does not say, in case of upgrade does not function can not guarantee the integrity of the metadata is unpredictable. Far more trouble than you think. Do not assume that with Cloudera Manager or other management software you can really automate the operation, and deploying Hadoop is just the first step in March.
Reason three: Hadoop 2.x is currently very unstable, more bugs, newer iterations too fast, If you want to choose 2.x, think clearly and make a decision, this thing is not that you choose a new version of the foolproof, OpenSSL many years, but also appeared in the heart drop of blood loophole, not to mention just come out of the Hadoop2 in less than a year, to know, Hadoop upgraded to 1.0 with almost 7, 8 years, and after countless large companies including YAHOO,FAC Ebook,bat Such companies constantly update, repair, only to stabilize. Hadoop2 only appeared in less than a year, not after a long period of stable testing and operation, looking at the recent Hadoop from 2.3 upgrade to 2.4 in 1.5 months, fixed more than 400 bugs.
Therefore, do not recommend that you now directly in the production cluster on the 2.x, and so on to see, and so stable again on the not too late. If you're looking at Apache Jira, you can see that Hadoop 3.0 has already started an internal bug trail.
About Hadoop?
I think enterprises need to consider the talent problem of Hadoop from two aspects, one is to develop talent, one is to maintain talent.
Development talent is currently scarce, the basic focus on the Internet, but this is a relatively short time to solve things, with the popularization of Hadoop training and dissemination. As well as the improvements in the interface of Hadoop itself, there will be more and more people.
Maintenance Talent I think the industry outside the Internet for a period of time basically do not consider, not too much, but not at all. Hadoop and cloud computing are the last to fight the operation of the dimension, large-scale distributed systems are extremely difficult to train personnel. Especially DevOps, the devops is very scarce, and in the rare talent is mostly with puppet, fabric to engage in the web operation, to the distribution system to transport the difficulty or some. So this kind of talent is difficult to recruit, also difficult to cultivate.
Then you need to identify the type of developer you want, and for example, Hadoop is like a Windows or Linux operating system where you can use Photoshop to paint, animate with 3dmax, or use Office to work with tables. But the purpose of application software is different. This still requires Cto,cio to have a minimum understanding of large data and Hadoop and peripheral applications. Instead of comparing Hadoop with MySQL php or traditional Java EE, think it's easy to outsource. Not at all.
What about Hadoop training?
After several enterprises in the Hadoop in-house training, I found that the transformation of enterprises have a problem is biting. Want to do a training to get the Hadoop and everything around, the more typical is a company I recently trained in Shanghai, from Hadoop to HBase to Mahout to spark storm all to listen to. Then the training institutions can only find a few teachers to speak different content, I think this kind of training for the enterprise is not significant, at most is to give staff a chance to take a nap.
First, Hadoop is not one or two lectures on what can be understood, in addition to theoretical knowledge, but also a lot of practical experience to support.
Second, each Hadoop eco-component is a very complex thing, and it's really easy to use, but it's not easy to really understand each component. Especially mahout,spark,r these things that involve a lot of statistics and mathematical theory, you call a bunch of products, no programming and statistical background of people to attend lectures, they really can only take a nap, I think let them come to listen to Hadoop is very cruel thing, clearly do not understand, because the leader in the side, Still have to work hard to stay awake.
Third, everyone is good at different areas, no one teacher can speak both Windows Server operation dimension, and can speak excal advanced skills can also speak 3DMax animated Photoshop drawing. and training institutions in order to rob a single, often promised to find a few teachers together, enterprises also often feel that the same price, I put all listen to, how cool ah. In fact, each teacher's lecture style, knowledge point level, content design are different, chicken, flour, vegetables are not necessarily a large chicken and belt noodles, but also very likely instant noodles, finally make food tasteless tasteless. Therefore, enterprises in the choice of training must be targeted, do not engage in all-inclusive, waste of resources do not say, still no effect. Can be divided into several different training direction, to find a different, professional and strong training institutions to complete. Of course, this also needs to have a certain idea and vision, more is, at least you as a leader, should know more than others, not to say technical details, but the technical direction of the grasp more accurate than the staff Cto,cio.
About docking with traditional business?
This is also a lot of people care, especially the traditional enterprise, before using Oracle, a large number of data stored in the inside, a flash with Hadoop replacement is impossible. This I think is a lot of thinking, Hadoop is offline analysis processing tools, not to replace your database, in fact, it is not possible to replace the relational database. What he's doing is that relational databases can't do dirty dirty, it's a complement to the original business architecture, not a replacement.
And this kind of assistance and substitution is done progressively, not overnight, within the scope of my understanding, no one company came up to say that I directly to the MySQL without, directly on Hadoop, met this, I will first praise his determination, and then I refused to give him a plan, I will definitely tell him, This is impossible.
Hadoop provides a variety of tools to do traditional database business docking, in addition to Sqoop, you can write yourself, Hadoop interface is very simple, JDBC interface is also very simple.