Guide: Yahoo CTO raymie Stata is a key figure in leading a massive data analysis engine. IBM and Hadoop are focusing more on massive amounts of data, and massive amounts of data are subtly altering businesses and IT departments.
An increasing number of large enterprise datasets and all the technologies needed to create them, including storage, networking, analytics, archiving, and retrieval, are considered massive data. This vast amount of information directly drives the development of storage, servers, and security. It also brings a series of problems to the IT department that must be addressed.
Information technology research and analysis company Gartner believes that mass data processing should be a large number of different types of structured and unstructured data through the network into the processor and storage equipment, and accompanied by the conversion of these data into business reports.
Three main factors of mass data processing: large-capacity, multi-format data and speed
Large-capacity data (TB, Petabyte, even EB): More and more business data from people and machines create more challenges to IT systems, storage and security of data, and future access and use of these data have become difficult.
Multi-format data: The massive data includes more and more different formats of data, and these different format data also need different processing methods. From simple emails, data logs and credit card records, to scientific research data, medical data, financial data, and rich media data (including photos, music, videos, etc.) that the instrument collects.
Speed: Speed refers to the speed at which data is moved from the endpoint to the processor and storage.
Dan Kusnetzky, an analyst at Kusnetzky Group, said in his blog, "Simply put, big data is a huge dataset and storage facility tool that allows organizations to create, manipulate, and manage." Does this mean that there will be more data sets than TB and petabytes in the future? The supplier's response is "it will appear."
They might say, "You need our products to manage and organize the use of large-scale data, just think of the complexity of maintaining a dynamic data set of problems caused people headaches." Another value of massive data is that it can help companies make the right decisions at the right time.
Historically, data analysis software in the face of today's massive data has become powerless, this situation is quietly changing. The new massive data analysis engine has emerged. such as Apache Hadoop, LexisNexis HPCC System, and 1010data (hosted, massive data analysis platform provider) based on cloud-based Analytics services.
Tim Negris, senior vice president of 101data, says the collection of massive data and the storage and use of massive amounts of data are really two different things. Doing anything beforehand requires a lot of (preparing data) work is one of the challenges facing Oracle and most database vendors. We are trying to eliminate this problem and hand the data directly to analysts. Hadoop and HPCC systems do this. These three platforms are focused on massive data and provide support.
The Open-source Hadoop has proven itself to be the most successful data-processing platform in the market over the past 5 years. Currently Cloudera's chief executive and Doug Cutting of the Apache Foundation are the founders of Hadoop, who worked on Yahoo.
Hadoop decomposes massive amounts of data into smaller, more accessible bulk data and distributes it to multiple servers to analyze (agile is an important attribute, just as you can easily digest foods cut into small chunks). Hadoop handles queries.
"Gartner and IDC analysts believe that the processing speed of massive data and the ability to process data are the places where Hadoop attracts people." Charles Zedlewski, vice president of Cloudera's products, said.
After cutting and his Yahoo team put forward the Hadoop project, the Yahoo IT system was tested and widely used for many years. They then released Hadoop to the open source community, making Hadoop progressively more product-enabled.
When cutting and Yahoo are developing, testing, and running code internally, they learn that it's complicated to use. This leads them to realize that if they provide peripheral services in the future (such as providing intuitive user interface, custom deployment, and additional functional software), they can earn more money.
Launched in 2009 as an independent company, the company's products use open source and Cloudera the Hadoop Analytics engine and Cloudera Enterprise Edition (Cloudera Enterprise integrates more tools, including Hive, HBase, Sqoop, Oozie, Flume, Avro, zookeeper, Pig and Cloudera).
Cloudera is favored by a large number of investors, including VMware's founder and former chief executive Diane Greene, the co-founder of Flickr Caterina Fake, the former CEO of MySQL Marten Mickos, LinkedIn president Jeff Weiner and Facebook CFO Gideon Yu.
Since Cloudera was founded, only a handful of top companies and start-ups have provided free versions of their own based on the Hadoop open source architecture.
This is a real enterprise technology competition. As in a relay race, all players must use the same type of baton (Hadoop code). Enterprise competition focuses on the speed, agility and creativity of data processing. The competition is the most effective way to force most companies to make a difference in the massive data analysis market.
IBM offers Infosphere biginsights based on Hadoop (IBM Infosphere biginsights is software and services for analyzing and virtualizing massive amounts of data, which is supported by Apache Hadoop.) Basic Edition and Enterprise Edition. But the company has bigger plans.
IBM CEO Sam Palmisano says IBM is taking a new generation of data analysis as the company's research focus, and IBM has invested 100 million of dollars in the project. Laura Haas, director of the IBM Academician and Computer Science research lab, said that the IBM laboratory research was far beyond the scope of massive data and had embarked on "exadata" analysis. Watson is the result of IBM's research on data mass data, and Watson will be used for more purposes, including health care, scientific research, etc.
Other Hadoop versions
MAPR has released a distributed file system and MapReduce engine, and MAPR has collaborated with the storage and security leadership of EMC to provide customers with Greenplum HD Enterprise Edition Hadoop storage components. Another unique feature of EMC Hadoop is that it does not use the official version of Apache code, but rather Facebook's Hadoop code, which is optimized for scalability and multi-site deployments.
Another vendor, Platform Computing,platform, provides a distributed analysis platform that is fully compatible with the Apache Hadoop MapReduce programming model and supports multiple distributed file systems.
SGI (Silicon Graphics Analysys) provides a Hadoop optimization solution based on SGI rackable and Cloudrack server product Implementation services.
Dell has also started selling servers pre-installed with the Open-source data-processing platform. The cost of the product varies depending on the support options, with the base configuration price between USD 118,000 and USD 124,000, including a one-year Cloudera support and update, 6 PowerEdge C2100 servers (2 management nodes, 1 edge nodes, and 3 outbound nodes), and 6 Dell powerconnect 6248 switches).
A substitute emerges from the surface. including 1010data cloud services, the Lexusnexis company's disorientated, which helped Lexusnexis analyze a large number of customer data over the past 10 years and applied in the financial and other important industries. Lexusnexis recently announced that it wants to share its core technology in the Open-source community instead of Hadoop. LexisNexis Company publishes an open source data processing scheme, the technology is called HPCC system.
HPCC can be managed, sorted, and can be divided into billions of records in seconds. HPCC provides two kinds of data processing and service modes--thor data Refinery Cluster and Roxy Rapid data IBuySpy. Escalante said it was so named because it could solve difficult problems like Thor (the Nordic mythology of Sre, war and Agriculture), Thor mainly used to analyze and index large amounts of Hadoop data. Roxy is more like a traditional relational database or data warehouse, and can even handle the service of Web front-end.
LexisNexis CEO James Peck says we think this is the right move, and we believe that the HPCC system will elevate the mass of data to a higher level.
In June 2011, the Yahoo and Silicon Valley venture capital company, Benchmark, announced in Tuesday that they would jointly set up a new company called Hortonworks to take over the development of the widely used data analysis software Hadoop.
According to some former Yahoo employees, from a business point of view, Hortonworks will remain independent and develop its own business version.
In the transition period, Yahoo CTO Raymie Stata become a key figure, he will be responsible for all the company's IT projects development. Stata said that we would devote more effort to the work of Hadoop and related technologies than Yahoo, and that we should invest more in Hadoop in Hortonworks. We will assign some key personnel to the Hortonworks company, but this is neither a layoff nor a spin-off. This is an increase in the input to Hadoop. Yahoo will continue to make a greater contribution to the development of Hadoop.
Stata explains that Yahoo has always had a dream of turning Hadoop into an industry standard for large data analysis software. But this must commercialize Hadoop. Stata says the main reason for creating hortonworks is because Yahoo has seen the future of Business Analytics (thanks to Hadoop for 6 years) and knows how to do it. We see massive data analysis will soon become a very common demand of enterprises.
We've deployed Hadoop to the enterprise, and I don't think everyone is denying that solution. We want to create value for our shareholders through Hadoop. If Hadoop becomes the industry standard for mass data processing one day, it will be the best reward for us.
(Responsible editor: admin)