Introduction: Open source data processing platform with its low-cost, high scalability and flexibility of the advantages has won the majority of network Giants recognized. Now Hadoop will go into more business. IBM will launch a DB2 flagship database management system with built-in NoSQL technology next year. Oracle and Microsoft also disclosed last month that they plan to release a Hadoop-based product next year. Two companies are planning to provide assistance with deployment services and enterprise-level support. Oracle has pledged to preinstall Hadoop software in large data devices.
The Big Data revolution is in the middle of Apache Hadoop. The debate has been heard since the Open-source distributed data-processing platform was released 5 years ago. But over the past 18 months, Hadoop has won the endorsement of its customers and has been supported by many commercial support and the integration of numerous database and data integration software providers. The three most famous commercial data providers among the many vendors are Oracle, IBM, and Microsoft.
Will Hadoop become a major technology for big data in the future?
Hadoop is a software framework for distributed, dense data processing and analysis based on Java. Hadoop is largely inspired by the MapReduce technology that Google elaborated in the 2004 white Paper. MapReduce works by breaking tasks into small tasks that are hundreds of thousands of pieces, and then sending them to a cluster of computers. Each computer transmits its own part of the information, and MapReduce quickly integrates the feedback and forms the answer.
Hadoop is very scalable, and Hadoop handles large data distributed across thousands of low-cost X86 server compute nodes. And due to the well-known Moore law, memory and disk capacity is also growing. Hadoop support for hardware is also increasing, with 16 cores of processors, 12TB, or even 24TB disks per node now deployed. Cloudera revealed that its products cost about 4000 dollars per node. This price is a competitive advantage for a relational database deployment of 10000 to 12000 dollars per TB.
This combination of high capacity and low cost is compelling, but Hadoop is most appealing to its ability to handle mixed data types.
Hadoop can manage structured data, as well as data such as server log files and Web-click streams. You can also manage data that is centered on unstructured text, such as Facebook and Twitter. This ability to handle multiple types of data is important. It spawned NoSQL platforms and products. such as Cassandra, CouchDB, MongoDB, and Oracle's latest NoSQL database. Traditional relational databases such as ORACLE,IBM Db2,microsoft SQL Server and MySQL cannot handle mixed data types and unstructured data. Hadoop obtains the attention and support of most data analysis vendors due to the need for transaction flexibility.
Hadoop has been widely used
Today, Hadoop is considered a proprietary technology for unstructured data. The advantages of low cost, high scalability and flexibility have become the first choice for network giants such as AOL and comscore that deal with mass-click traffic analysis and advertising positioning.
AOL has been using Hadoop for more than three years. AOL's research team deployed 300-node systems in California State's Kings view, which can store billions of of events per day and more than 500TB of click-Stream data. The stream data generated by clicking is highly structured, but the volume of data is very large and diverse. So it's almost impossible to handle all the extraction, conversion, and load work. To solve these problems, AOL decided to use Hadoop mapreduce to process data filtering and associated tasks distributed across hundreds of compute nodes. Because of the advantages Hadoop brings to the business, AOL's Hadoop research team deployed 700-node systems at its headquarters this April.
The nature of Hadoop for all types of data is destined to enable Hadoop to be used in a wider range of areas. Examples include hosting services and small and medium enterprise application Service provider Sungrad. They plan to launch a cloud-based hosting service designed to help financial services companies deal with their data processing based on Hadoop MapReduce.
Commercial software vendor Tidemark has recently launched a SaaS software that uses MapReduce to transform a mixed data source into a product or financial planning solution.
Three big data fields with strong exerting force
At the IoD annual conference held in Las Vegas, USA last month, IBM Academician, DB2 General architect Curt Cotner announced that IBM will launch a NOSQL flagship database management system with built-in DB2 technology next year. He also said that the future direction of database development is the NoSQL database. At present, Google's BigTable and Amazon's dynamo are used NoSQL database, and the traditional relational database in dealing with ultra-high-scale, high concurrent SNS, web2.0 site has been powerless. IBM has also released a series of data analysis software, including the cloud computing version of Infosphere biginsights. Biginsights is a set of data analysis software built on Hadoop, capable of processing enterprise users to collect a large amount of unstructured data.
Microsoft also announced at the SQL Pass 2011 Summit in Seattle on October 12 that it would work with Hortonworks, a spin-off from Yahoo, to develop Hadoop and build Windows Azure and windows on Apache Hadoop Server platform. Windows Server, which is based on Hadoop, also handles tasks jointly with Microsoft's existing BI tools.
Oracle has also acted as the world's largest relational database provider. It launched the big Data appliance at the Oracle Global Conference. The Big Data appliance is a system that integrates Hadoop, NoSQL databases, Oracle database Hadoop adapters, Oracle database Hadoop loaders, and R languages.
The future of Hadoop
Based on the current situation, Hadoop, as the core technology of enterprise Data Warehouse architecture, will continue to grow in the next few years. New Hadoop-related companies, including MAPR, Zettaset, Cloudera, hstreaming, Hadapt, DataStax, Datameer, have been invested and are well known for bringing the latest technology to various markets.
At the same time, the next generation of MapReduce will improve a lot of people before the ideal place. First, the number of nodes will increase from the current 4000 to 6000-10000, followed by the number of concurrent tasks from the current 40000 to 100000. Additional hardware support will continue, along with changes to the architecture, including more programmatic support.