It is estimated that by 2015, more than half of the world's data will involve hadoop--an increasingly large ecosystem around the open source platform, a powerful confirmation of this alarming figure.
However, some say that while Hadoop is the hottest topic in the bustling Big data field right now, it is certainly not a panacea for all the challenges of data center and data management. With that in mind, we don't want to speculate about what the platform will look like in the future, nor do we want to speculate on the future of open source technology that radically changes the data-intensive solution, but instead focus on the real case where Hadoop is getting more and more hot.
There is no doubt that there are several outstanding examples of how Hadoop and related Open-source technologies (hive and HBase) are looking at infrastructure in terms of how to reshape big data companies.
While we will continue to specialize in a series of articles written before the Hadoop Wrld Congress this year, it is useful to illustrate several high-profile, large-scale Hadoop deployments, which are reshaping companies that rely on large data, which engage in social media, Tourism and general goods and services industries.
Let me introduce you to one of the first companies you started to hear in the boom: E-Bay (EBay).
One case: The Hadoop environment of Electronic Harbor
Anil Madan of the Analysis Platform Development Group of the electronic Harbor company discussed how the auction industry's giants are leveraging the power of the Hadoop platform to make full use of the 8TB to 10TB data that floods the day.
While it was only a few years ago that the electronic harbour shifted to the production-type Hadoop environment, it was one of the big internet companies that first began to try Hadoop in 2007, when it used a small cluster to deal with problems with machine learning and search dependencies.
These are small amounts of data, Madan says, but useful for the pilot project, but as data grows and user activity becomes more frequent, the electronic harbour wants to take full advantage of data from several departments and the entire user base.
The first large Hadoop cluster in the electronic harbour is a 500-node Athena, a purpose-built production platform that meets the requirements of several departments within the electronic harbour. The cluster was built in less than three months, began to process forecast models at high speed, solved real-time problems, and later expanded to meet other requirements.
Madan said the cluster is now being used by many groups in the electronic harbour, both for day-to-day production operations and for one-off operations. The team uses Hadoop's Fair Scheduler (Fair Scheduler) to manage resource allocations, define job pools for groups, assign weights, limit parallel jobs for each user and team, and set preemption timeouts and deferred schedules.
Although Madan often talks about the real value of Hadoop on the stage, he often mentions several key challenges that the working Group faces and continues to grapple with when it comes to expanding the infrastructure of the electronic harbour. A series of challenges related to Hadoop are listed below:
Scalability
In the case of an existing version, there are scalability issues with the primary server Namende. As the file system of the cluster expands, the memory space it occupies expands as it saves the entire metadata in memory. 1PB of storage capacity will probably require 1GB of memory capacity. Several practical solutions are hierarchical namespace partitioning or, in combination with Zkeeper and hbase, metadata management.
Usability
The availability of Namende is critical to production workloads. The open source community is working on options such as cold backup (CLD standby), warm backup (warm standby), and hot backup (HT standby), such as Checkpoint (checkpint) node and backup node , switching avatar avatar nodes from secondary Namende, and log metadata replication technology. We are evaluating these programs to build our production cluster.
Data discovery
Support data monitoring, discovery, and schema management on systems that are not inherently supportive of structure. A new project is ready to merge the hive metadata store and WL into a new system called HWL. We are trying to connect the system to our analytics platform so that our users can easily find data across different data systems.
Data movement
We are working to develop publish/subscribe data mobility tools to support data copying and reconciliation across our different subsystems, such as the Data Warehouse and the Hadoop Distributed File System (HDFS).
Policy
Storage capacity management through quotas, where current hadoop quotas need to be improved, to make good retention, archiving, and backup strategies. We are trying to define these policies across different clusters based on the workload and characteristics of the cluster.
Metrics, metrics, metrics
We are developing proven tools to generate metrics for measuring data sources, usage, budgeting, and utilization. Some Hadoop Enterprise Server embodies some of the existing metrics are not comprehensive, and some only temporary, it is difficult to see the pattern of Chu cluster use.
Case TWO: GE uses Hadoop to analyze users ' emotions
Emotional analysis is tricky – it's not just a technical challenge, it's a business challenge, according to Linden Hillenbrand, product manager at GE's Hadoop technology department.
In General Electric, the digital media Group and the Hadoop Group worked together to develop an interactive application for the marketing department that relies heavily on advanced affective analysis.
The aim is to allow the marketing team to assess the external perceptions of GE (positive, neutral or negative) through the activities we carry out. Hadoop is responsible for supporting the emotional analysis part of the application, which is a highly intensive text mining application environment for Hadoop.
He claims that using Hadoop to address these challenges at the technical level has brought significant improvements.
To emphasize this, Hillenbrand refers to the company's unique nsql approach to affective analysis, which brings a 80% per cent accuracy and is the basis of the core platform of Hadoop to ensure that the company's future development in data mining. As the following illustration shows, GE has made a significant improvement in leveraging data mining and new platforms to bring a new insight.
Hillenbrand said the affective analysis project had been doubly successful in terms of GE's internal vision of the future of Hadoop. It not only provides more accurate results for the marketing team of the Fortune 50 company, but also lays the groundwork for the company's next-generation in-depth data mining, analysis, and visualization project.
Case THREE: Typical application cases of tourism industry
Rbitz Wrldwide's global consumer tourism brand handles millions of searches and deals every day.
By using traditional systems such as relational databases, it is becoming increasingly difficult to store and handle the growing volumes of data generated by such activities, so the company has turned to Hadoop to help eliminate some of the complexities.
The company's chief software engineer, Jnathan Seidman, and another engineer, Jairam Venkataramiah, have been happy to discuss how the travel site's infrastructure is managed. They discussed the role of Hive in a recent communication to a number of listeners, especially for key search functions.
Hadoop and Hive help the online travel center deal with all sorts of things: improvements that allow visitors to quickly filter and classify hotel functions, to see more macroscopic internal trends. According to the two engineers, Rbitz's big data problem makes it a "typical" use of Hadoop. They say it is no easy job to deal with millions of of these searches and deals every day, in the face of a decentralized network of services that generate hundreds of GB of the day's log daily.
In the above slides, they demonstrate how to use Hadoop and hive to process data, and perhaps more importantly, what makes the company's specific problems best suited to use Hadoop (as a reminder that not all businesses have Hadoop in place). )
Case FOUR: Facebook updates the state of Hadoop
While some companies and institutions are secretive about their huge hadoop systems, Facebook's Data Warehouse Hadoop cluster has become the world's largest known Hadoop storage cluster for known systems.
Here are some more details about this single HDFs cluster:
The storage capacity of a single HDFs cluster is up to PB
2000 Machines
TB per machine (with several machines per 24TB)
1200 machines each had 8 processor cores, 800 machines each had 16 cores
Each machine has GB of memory
15 Mapping/Simplification (map-reduce) tasks per machine
A total of more than 21PB of configured storage capacity is larger than the previously famous Yahoo cluster (14PB). In the early days of Hadoop, Facebook was leveraging this framework to manage its evolving business with several other internet giants.
With more than 400 million active users per month, more than 500 billion page views, and up to 25 billion content per month, Facebook is the perfect application for any technology claiming to be able to handle big data problems.
Facebook engineers work closely with Yahoo's Hadoop engineering team to push Hadoop to higher scalability and performance. Facebook has many Hadoop clusters, the largest of which is used in data warehouses. The following statistics describe several features of the Facebook Data Warehouse Hadoop cluster:
Add terabytes of compressed data per day
Scan terabytes of compressed data daily
Handle 25,000 mapping/simplification assignments per day
There are 65 million files in HDFs.
30,000 clients simultaneously access HDFs Namende
Facebook's software engineer, open source advocate Jnathan Gray, demonstrates how Facebook has been using part of the larger Hadoop platform architecture: HBase, which supports online applications and offline applications in a production environment.
Although the slides are a bit esoteric and specific to the environment, they generally describe the complex data environment in which hbase is appropriate, and more importantly, describe some of the major adjustments and expertise that the environment needs to manage. HBase is only one of the methods that FACEBK manage massive data and provide users with abnormal intelligent services.
Case FIVE: infchimps processing 1 million times-fold mixing (mashup)
Ask Phillip "Flip" Krmer where you can find almost any list, spreadsheet, or dataset, and he'll be happy to introduce his company Infchimps, the company that claims to be "the world's Data Warehouse."
Every month thousands of people visit the site to search for specific data. Recently, users of the site are querying Twitter and social networking data. Its more traditional datasets include other popular data, such as finance, sporting events and stock data.
Krmer says, of course, users can query these datasets elsewhere, but they often visit infchimps, not necessarily because of lack of data or hard access to data, but because of the extremely high cost of obtaining data elsewhere, Or the data is in a format that is not suitable for use--at least for the customer base of infchimps-oriented developers.
The company is assembling a data repository containing tens of thousands of public and commercial datasets, many of which are TB-level. Modern machine learning algorithms analyze data in depth by using the general structure of the data, even if the data is organically embedded in the linked dataset. Of course, all of this work brings a complex data environment that requires a platform that can be run across multiple objects, both internally (data collection and management) and platform users.
Infchimps gives users the leverage to use Hadoop and the Amazon Cloud and Rackspace Cloud infrastructure to make the most of the data. As you can see from below, the company makes full use of Flex Hadoop, also leverages Amazon Web Services (AWS) and Rackspace, while using Hadoop on the back end to meet its requirements.
The company allows users to get the Hadoop resources they need at any time, whether they are scheduled, temporary, or dedicated. This flexible feature enables nightly batch operations, compliance or test clusters, scientific systems, and production systems. Coupled with a new Irnfan (Infchimps Automated System Configuration Tool) based on Hadoop features, Flex Hadoop allows users to tailor resources to the job at hand. This simplifies, according to Infchimps, the process of mapping or simplifying machines such as specialized machines, high computer devices, and high memory machines as needed.
Case SIX: The role of Hadoop in tapping military intelligence
Digital Reasning claims that, in one of the core markets: the US government, it is leading the way in "automatic understanding of Big Data".
Digital Reasning, who is committed to this goal, has recently struggled to comb through the vast, unstructured text data from US intelligence to look for threats that could endanger national security. This customized software for entity-oriented analysis (entity-riented analytics) has become the core of Synthesys technology, which is the base of its business.
The company uses the Cludera distribution, and its Synthesys platform supports distributed, column-oriented open source database HBase. According to Digital reasning, "This integration gives us access to hyper-scale processing and provides complex data analysis capabilities for government and business markets." "
CEO Tim Estes the company's infrastructure and the use scenario in the following slides:
"Cludera and its Hadoop expert team work closely with us to make new breakthroughs in the complex analysis field. Cludera and Digital reasning together provide the most demanding customers with the ability to identify and correlate entities to a very large number of different datasets, "says Tim Reasning, chief executive of Digital Estes.
He went on to say that previously, "isolated islands of critical intelligence data" could only be "orphaned", but Synthesys integrated Cludera's Apache Hadoop (CDH3) and HBase support features We can combine the algorithms used to automatically understand the data with the platform that can handle the scale and complexity in an unprecedented way and connect the parts together. "