Guide: Recent American "Wired" magazine author Cade Metz to write a mapr commentary article. He believes MAPR has the necessary elements to develop Hadoop and succeed. The following is the full text
M.C Srivas (founder of MAPR) helped build the Google search engine, which makes Google's search engine amazing!
If the user enters the "2005 Honda Accord" in the Google search box, Google search will be considerate to know that you are looking for a suitable family class sedan. And not only return to the user on the "Honda Accord" link, but also return some similar prices of family-class sedan to the user reference-such as the Volkswagen Passat or Toyota Camry.
Google can intelligently clarify the difference between the two words "apartment" and "house" in search. When the user enters "new" in the search box, the search box also prompts "New York" and "New York Times" like the Intellisence feature in Visual Studio.
But that doesn't mean M.C Srivas is just praising Google's "famous" search algorithm. The infrastructure that backs Google's algorithms in the background is what he deems most commendable. such as the famous GFs (Google File System) and Google MapReduce. MapReduce is one of the earliest software architectures that Google has proposed for parallel operations of large-scale datasets.
Google search, while benefiting from the role of the algorithm, MapReduce in the background to play a great role. It collects related Web pages through the network and places them in searchable indexes. "The work we did at Google was amazing, and I was shocked by the efficient use of the data," M.C Srivas said. Nowadays these two technologies have been widely used in servers and used to store and analyze large amounts of data.
M.C Srivas The Google search infrastructure team for 2 years, and in the summer of 2009 he chose to leave Google and create a company--MAPR. MAPR also employs the excellent design ideas behind Google's infrastructure (Google GFs and MapReduce) and provides large data-processing operations. As M.C as any other company. Srivas will commercialize and sell based on open source Hadoop products.
Unlike other competitors, however, MAPR offers a lot of features that are different from Hadoop, and the company claims that MAPR is a product that is three times times faster than the existing Hadoop Distributed file system. In order to perfect MAPR,M.C. Srivas took his team 2 years to refactor Hadoop and eliminated the flaws as a large data-processing platform. M.C Srivas to the US Wired magazine, "Three years ago I spoke about Hadoop in public speaking, and three years later today these problems still exist in the open source Hadoop version." At some point, what we're going to do is ' this can't be solved ' and then throw it and refactor it, which is what we did in the past 2 years.
In the age of the Internet, more and more data are pouring into the global enterprise. And Hadoop has become a model for Internet giants to reinvent their hardware and software to handle day-to-day business. Hadoop uses inexpensive server clusters to analyze and process large amounts of unstructured data.
Today's technology giants such as Microsoft, Oracle and IBM offer their own Hadoop-based products. MAPR is just one of the startups, and Cloudera and Hortonworks are equally compelling. Cloudera and Hortwornworks are also working to raise open source projects today, but their code is proprietary.
These startups have their own unique aspects of the improvement of Hadoop, and because their products are facing fierce competition in the market, it is inevitable that other manufacturers of the product is criticized. And M.C Srivas's transparent plan for MAPR development forcefully refutes all the blame around it. At the same time, he said that Hadoop, though strong enough, still needs to be carefully carved.
MAPR is similar to Google
In fact, Google does not actually use Hadoop (Google's cloud-computing infrastructure model includes four separate, tightly knit systems, including Google file systems built on the cluster, Google files system, The MapReduce programming model, distributed lock mechanism Chubby, and the simplified large-scale distributed database bigtable developed by Google are introduced for the features of Google Apps, and Yahoo! And Facebook's big data-processing platform is based on content from Google's research paper.
"Google, Facebook and Yahoo! have proven that the Hadoop platform is in prime time," Coo Kirk Dunn of Cloudera said to Wired. Tens of thousands of nodes within Google, Facebook and Yahoo! have been running for years. Although Yahoo! And Facebook use thousands of ordinary servers based on Hadoop to handle unprecedented volumes of data, but most businesses need to deal with no Yahoo! Or Facebook, a smaller cluster platform is enough to meet the needs of most business. ”
At the same time, m.c. Srivas again emphasizes the inadequacy of the Open-source version of Hadoop, such as "single node failure (that is, if the primary node is abnormal, task execution is lost, and data may be corrupted)" That still plagues open source Hadoop. Yahoo! and Facebbok hired 50 to 70 engineers to deal with such incidents, while others did not have the relevant personnel.
M.C Srivas said he had met Cloudera's founders before setting up MAPR and considered joining them. But Cloudera wants to make a profit on the Linux strategy to Red Hat-both to support, service, and other software around the open source platform. This is not consistent with the idea of M.C Srivas.
He became acquainted with CEO John Schroeder of Calista Bae (a virtualization software vendor, acquired by Microsoft in early 2008) and founded MAPR Company in 2009. Today, MAPR's products have provided technology for storage giant EMC's Greenplum HD Enterprise Edition Hadoop.
Schroeder and M.c Srivas are friends at Google and are working together to perfect mapreduce work. Schroeder and M.C Srivas hold the same view, the biggest reason Google's success is due to Google's underlying infrastructure, rather than search algorithms. Schroeder says Google Mapreduce, GFS and bigtable technology keep Google in the industry's lead.
The future of Hadoop
According to M.C Srivas and Schroeder, their Hadoop distributions lead other Open-source Hadoop distributions on many features. While others do not think so, this is an indisputable fact that MAPR's products overcome the inherent flaws of other open source versions of Hadoop.
Hadoop implements a distributed filesystem called HDFS (Hadoop Distributed File System) and a large data computing platform called MapReduce. MapReduce relies on HDFS implementations. Typically, MapReduce divides the data of the target's count into small chunks, hdfs each block to ensure the reliability of the system, while placing chunks of data on different machines in the cluster according to certain rules to mapreduce the fastest calculation on the data host machine.
M.c. Srivas that during the 2 years of development, MAPR basically reconstructed the file system. At the same time, the "Job tracker" of Hadoop has been improved so that it can be distributed across machine tasks and manage its execution. As a central server, Namenode is responsible for managing file System namespace and client access to files. The open source version of Hadoop still has a single node failure and a limited number of namenode processing files.
Cloudera's Kirk Dunn also admits M.c. Srivas mentioned the drawbacks of open source Hadoop, but said there were other factors to consider when evaluating the advantages of open source Hadoop. At the same time, open source version of Hadoop will eventually overcome the inherent defects, eventually all the code because of open to become a unique advantage! As we all know, the advantages of open source can be widely supported by the community. Are you willing to rely on hundreds of engineers to support important issues? Or is it a company that relies on only a handful of elite engineers?
In essence, Hadoop is primarily a "batch" system. Hadoop takes a while to process data to get results. Hadoop today does not have the ability to generate information in real time. With the development of search engine demand, Google has abandoned the MapReduce, and moved to the platform called "caffeine", the new platform can make the search engine faster. John Schroeder hinted that MAPR was also working toward a similar "direction", although its "solution" might look very different from "caffeine".
M.C Srivas points out that today's Hadoop is completely different from the version that runs inside Google. In addition to GFS and MapReduce, Google runs a job scheduling and monitoring system called "Borg" at its software layer, which is primarily responsible for managing server clusters within the datacenter. Google has yet to disclose information about "Borg". Like all Google's former employees, M.C Srivas could not disclose its details. But M.C Srivas says you can't mistakenly think that Hadoop is Google's infrastructure. Companies like Google must have their own secret weapons that have not yet been announced.
To be successful, Hadoop must continue to evolve. and MapR has everything. (Li/compiling)
(Responsible editor: The good of the Legacy)