The recent investment in cloud computing by major giants has been very active, ranging from cloud platform management, massive data analysis, to a variety of emerging consumer-facing cloud platforms and cloud services. And the large-scale data processing (Bigdata 處理) technology which is represented by Hadoop makes "Business king" Change to "data is king". The prosperity of the Hadoop community is obvious. More and more domestic and foreign companies are involved in the development of the Hadoop community or directly open the software that is used online.
Yahoo!, who was in a strong competitive relationship with Google, then recruited Doug (the founder of Hadoop) to open the source of Dfs and Map-reduce, which was the lifeblood of Google, and began the childhood of Hadoop. It was almost 2008, when Hadoop became mature. From its inception to the present, Hadoop has been accumulating for at least 7 years, and now Hadoop is not only a special product of the second Yahoo, from the long list of users of Hadoop, you can see Facebook, Linkedin, Amazon, you can see EMC, EBay, Twitter, IBM, Microsoft, Apple, HP ... Domestic companies have Taobao, Baidu and so on.
Not only that, but the latest news shows that even Microsoft, the software giant, has opened its arms to Hadoop recently. At the SQL Pass 2011 Summit in Seattle, October 12, Microsoft announced that it would collaborate with Hortonworks, a spin-off from Yahoo, to build Windows Server and Windows Azure platform on Apache Hadoop. Hortonworks, as Microsoft's strategic partner, will help maximize the integration of Hadoop into Microsoft's products by leveraging its expertise in this area.
Microsoft has said it expects to launch the Windows Azure Preview of Hadoop by the end of this year, while the Hadoop based Windows Server will be launched in 2012. Windows Server based on Hadoop also handles tasks jointly with Microsoft's existing BI tools. Microsoft officials also confirmed that SQL Server "Denali" would be officially named SQL Server 2012. At the same time, Microsoft will also increase its input to the JavaScript language, Microsoft will use JavaScript to achieve high-performance map/reduce. Microsoft is committed to working closely with the Hadoop community and actively contributing to the Apache Software Foundation's projects.
Ted Kummert, senior vice president of Microsoft's Business platform division, said in a statement that it would help Microsoft's customers better manage their big data. More and more companies are looking for ways to collect and analyze unstructured data to help them gain insight into their business. But so far, traditional relational databases have been designed primarily to deal with structured data, and their inherent characteristics lead to poor scalability. While the support of Hadoop as an open source framework for large data is increasingly appealing to it executives, Hadoop is ideal for dealing with unstructured data, such as content in e-mail messages, blogs, streaming data from clicks, audio and video.
Of course, other giants are not to be outdone, have some action. Oracle has also recently launched a large data device based on Hadoop and Oracle's own NoSQL database and distributed data analysis system based on open source language R. Just a few days ago, IBM announced it would buy platform Computing, a privately owned system software company. This helps IBM to better serve its customers, helping them manage and analyze large-scale data in a more appropriate manner, reducing cost and system complexity.
When it comes to Oracle's latest trends in Hadoop, it's not the same as the R language. R language as a source of data statistical analysis language is imperceptibly in the enterprise to expand their influence. Unique extensions provide free extensions and allow the R language engine to run on the Hadoop cluster.
R language is mainly used for statistical analysis, drawing language and operating environment. R was originally developed by Ross Ihaka and Robert Gentleman from Oakland University in New Zealand. (also known as R) is now being developed by the R Development core team. R is a GNU project based on the S language, so it can also be implemented as an S language, and code written in S language can be run without modification in R environment. The syntax of R is from scheme.
Now statisticians can use the R language, R language to excel in the analysis of unstructured data stored in a Hadoop Distributed file system. R can now run on HBase, a relational database, and a column-oriented distributed data store. The main imitation of Google's bigtable. This is essentially equivalent to using Hadoop to hold a database of structured data. Just like the subproject hbase of the Apache Software Foundation Hadoop project.
Revolution Analytics provides business software expansion and support for open source R language, which enables statisticians and scientists to find meaningful information from a large amount of important information in a short time. David Champagne, chief technology officer at Revolution Analytics, says the R engine can be deployed on every node in the Hadoop cluster. Instead of reducing the algorithm in Java programming, you can set up the R algorithm in a workgroup where R is deployed. It can parse the nodes of the Hadoop mapping function, while the parallel statistical analysis is stored in the HDFS data.
In addition, the open source operating system Ubuntu 11.10 Server version began to support JuJu (formerly code-named Ensemble) plan, which provides more than 30 kinds of cloud applications automatic deployment capabilities, support MySQL, Tomcat 6 and Hadoop, etc. Assist enterprises to accelerate large-scale deployment of cloud applications.
In the case of Hadoop deployment, IT staff had previously had to install Java programs, install Hadoop programs through Java programs, and then set up cluster relationships between servers. Now IT staff can install the JuJu program in version 11.10, as long as the command line input a few instructions to automatically install Java and Hadoop programs, and set up a Hadoop cluster, so that enterprises can quickly build Hadoop applications. Future IT staff to expand the cluster, as long as the juju input instructions, you can bring the new server into the Hadoop system.
Twitter also recently launched the open source real-time Hadoop computing system. This is a distributed, fault-tolerant real-time computing system that is hosted on GitHub and follows Eclipse public License 1.0. Storm is a real-time processing system developed by Backtype, and Backtype is now under Twitter. The latest version of the GitHub is Storm 0.5.2, basically written in Clojure.
Storm provides a common set of primitives for distributed real-time computing, which can be used in "streaming" to process messages and update databases in real time. This is another way to manage queues and worker clusters. Storm can also be used for "continuous computing" (continuous computation), a continuous query of the data stream, which outputs the results to the user in the form of a stream. It can also be used for "distributed RPC" to run expensive operations in parallel. Storm's chief engineer, Nathan Marz, said: Storm can easily write and extend complex real-time computations in a cluster of computers, storm to real-time processing, just as Hadoop is to batches. Storm guarantees that every message will be processed, and that it will quickly--in a small cluster, can handle millions of messages per second. What's even better is that you can use any programming language for development.
(Responsible editor: Liu Fen)