When "Big Data" becomes a topic for people, Apache Hadoop is often followed. There is a good reason for this: Hadoop has a file system that is not afraid to import different data structures, and a massively parallel processing system (MPP) to quickly process large datasets. Moreover, because Hadoop is built on commercial hardware and open source software, it has both a low and scalable advantage.
These features make Hadoop architecture a very attractive technology for CIOs, especially when faced with the pressure to introduce more differentiation, new data, and cost control. Brian Hopkins, Forreste's Enterprise architect, argues that running the way it was before is not effective enough to meet demand.
"The cost of expanding the local enterprise Data Warehouse is prohibitively high," he said. Large-scale parallel processing system (MPP) Data Warehouse equipment reduces the cost of data warehouse through its parallel architecture. But even so, the cost benefit is accompanied by a problem. That is, the cost per TB of data is still quite high. ”
So while the price of Hadoop is tempting, it is not the best technology solution for all big data problems. The technology is relatively new and imperfect, which means it will inevitably accompany the crux and problems. So how do CIOs decide when to deploy the Hadoop framework? Here are three scenes from the Ancestry.com company using Hadoop to get rid of the dilemma, which gives a clear revelation to the family tree network, and it's time to meet Hadoop.
Revelation 1: Enhanced data processing performance without the ability to pay "first class" costs
Until three years ago, Ancestry.com was still using the built-in data processing architecture, but with the growth of family tree records, subscribers and services, the expansion limit of the data architecture was gradually reached. The IT department of Ancestry.com, which has been trying to handle 4PB data, has finally turned to Hadoop to help solve the problem of data processing. Still, the company's genealogy website continues to use SQL Server-type data. Many data are first entered into Hadoop and then transferred to the Data Warehouse for daily analysis.
"We found that the best data architecture for us is a store that allows us to inject a lot of data into Hadoop while storing only a small amount of data in the data warehouse," said Scott Sorensen, senior vice president of Ancestry.com Engineering. ”
Hopkins calls this "reasonable cost" performance, which means that the data warehouse is used more efficiently and cost-effectively. Many companies are evaluating their data warehouse performance to find that a large portion of the data is not accessed and analyzed-a data that is sometimes as high as 60%,hopkins says. A recent survey by Gartner shows that when analytical technology and digital technology become more and more important, it budgets, if they remain unchanged, can lead to inefficiencies that undermine competitive advantage.
"Companies use a way like Hadoop to free up the space occupied by cold data in expensive data warehouses." The retention of cold data is for storage and historical reasons, the extraction of cold data is used for analysis and Hadoop, such as hive functions, but the advantage is that enterprises do not have to pay high fees. ”
Revelation 2: New products or services that rely on large data for support of revenue projects
Today, Ancestry.com is embarking on a new service: Regular chromosome DNA testing. Subscribing participants will have the opportunity to discover potential family expansion relationships through genetic matching. Although DNA testing is not the leading cause of the company's shift to Hadoop technology, the success of the service depends largely on it.
Hopkins said that "discovering desire" is another driver of the enterprise's shift to the open source framework in particular business needs-especially when it is not sufficient before using Hadoop.
"These are new applications to support new revenue, product innovation, or service innovation," Hopkins said, "and you'll see more in the area of marketing and customer intelligence." ”
One of the application cases is data acquisition, known as the "720-degree customer view," in which data from call centers and messages and from external social media are integrated in a single location to provide more meaningful customer profiles.
Not every business plan that relies on data will need Hadoop. Jeff Kelly, chief researcher and siliconangled editor at wikibon.org, points out that the beauty of Hadoop is its ability to store and process large amounts of data of the same type. The need for an external introduction of text, pictures, web blogs, and other data varieties into the internal business Data management environment provides a litmus test of the fast Hadoop deployment. If the business doesn't have to integrate these types of data, the CIO doesn't have to bother Hadoop.
"If your data is mostly structured and comes from within, there's really no reason to put that data into the Hadoop cluster," Kelly said. "Traditional technology has been well handled ... There's no reason to build another frame that you don't need. ”
Revelation 3: Need to broaden the business model
Ancestry.com's foray into regular chromosome DNA testing is not simply a new service; The genealogy research company is building a new business weapon.
Ancestry.com's analysis of DNA sequences means it is entering the field of bioinformatics. The company now has a bio-information Expert group that is adapting and ancestry.com the academic algorithms to deal with the scale of its own projects. In this new direction the business force has the potential to push genealogy research to another level: connecting users and distant relatives they may never have thought of finding.
"We can get the DNA data, but we don't just use it for DNA matching," ancestry's Sorensen said. "We can combine it with the 44 million (family) tree we own." When we can combine these two sets of data, that's really powerful. ”
Using data to help an enterprise doesn't necessarily mean dealing with a whole new field, as Ancestry.com did. A new application of data can lead to a redefinition of what the business has been doing. This typically requires business to go deeper into more data, or predictive analytics or data mining, and deploying Hadoop can help.
Kelly shares that view. "If your business is looking to become more data-driven, but because your infrastructure doesn't support a certain type of analysis that you want to do, you can't integrate the data, so the signs are that it's time to start looking for other ways," he said.