Today, some of the most successful companies gain a strong business advantage by capturing, analyzing, and leveraging a large variety of "big data" that is fast moving. This article describes three usage models that can help you implement a flexible, efficient, large data infrastructure to gain a competitive advantage in your business. This article also describes Intel's many innovations in chips, systems, and software to help you deploy these and other large data solutions with optimal performance, cost, and energy efficiency.
Big Data Opportunities
Large data are often likened to tsunamis. Currently, 5 billion mobile users worldwide and nearly 1 billion of facebook* and skype* users are generating unprecedented numbers of users who account for only a fraction of the global Internet population. Intel estimates that the total amount of data flowing through the cloud in 2012 exceeded 1,500 EB, 1, 500 billion GB. In stark contrast, the total number of words spoken in human history is estimated to be only about 5 EB. Yet the flood of big data has not yet come. At present, we are moving quickly to the "Internet of Things", and a large number of networked sensors in businesses, homes, cars and public places are generating almost immeasurable amounts of data (Figure 1). However, comparing large data to tsunamis ignores the most important point.
The exponential growth trend as of 2020
Figure 1. Current and projected large data growth rates. Source: Philippe Botteri of Accel partners,2013 year February.
Although the tsunami is devastating, large numbers have huge potential value. With the right tools and strategies, companies can learn from large data to help themselves achieve a strong competitive advantage. Many public and private organizations are already doing so.
• Retailers ' unprecedented scale and accuracy in real-time analysis of social media trends to provide the most likely buyers with the hottest products.
• Financial firms analyze credit card transactions, billing payments, and bank account activities to detect and prevent fraud in real time and increase the probability of losing money.
• Content providers analyze user choices in real time to increase revenue and customer satisfaction by recommending content to users and providing new products and services.
• Cities use large data to predict and mitigate traffic congestion and avoid costly road extensions.
• Utilities achieve load balancing of their energy networks by monitoring energy use in real time, so that they can output power more efficiently and reliably and avoid building large infrastructure.
New data management methods are needed to gain these advantages with large data. Large data differs from traditional business information. While transaction data is a part of this, large data has the features of multiple structure and fast moving, and the number may be far more than your infrastructure can handle.
• Large data is calculated in PB rather than TB, and thus far exceeds the traditional business data.
• Large data contains structured and unstructured content in many different formats, such as e-mail, social media, video, images, blogs, sensor data, and "shadow data" (such as access logs and Web search history), and many other types of data.
• Large data is dynamically generated and instantaneous insights can improve the results of real-time business scenarios.
Although batch analysis still plays an important role, on-demand queries from field or stream data have great potential.
Since the value of large data is contained in a large number of complex, fast-moving content, obtaining meaningful insights often requires extensive mining and in-depth analysis, with breadth and depth far beyond traditional business intelligence (BI) queries and reports. By using machine learning, statistical modeling, graphic algorithms and other emerging technologies, it is possible to excavate important content that can be readily available, thus providing a significant competitive advantage for enterprises.
Extract business value from large data
In this paper, three kinds of use models for extracting value from large data are discussed. These three usage models apply to a variety of different organizations. Each use model is based on the former and is designed to provide higher value.
• Use model 1-extract, transform, and load (ETL). ETL before analyzing data, you must perform collection and preprocessing operations and store the data in the database. However, traditional ETL tools and strategies may not meet the needs of large data. Apache hadoop* provides a cost-effective way to address this challenge.
• Use model 2-interactive queries. The latest technological innovations greatly improve the performance and scalability of traditional data warehouse models. With these improvements, real-time analytics can be run on larger, more diverse datasets to extend the value of existing BI investments and more effectively integrate with new large data solutions such as Hadoop.
• Use Model 3-predictive analysis. New analytical techniques include not only data mining and visualization, but also analyzing all available data, including real-time data streams from external sources, not only to determine what happened and why, but to anticipate what might happen. The last use model is built on the first two, designed to create a more unified, scalable analysis environment.
Using Model 1-using the Apache hadoop* ETL
Similar to traditional data, large data must be extracted from external sources and transformed into structures that meet business requirements, which are then loaded into the database for storage and management. Because traditional ETL solutions cannot meet the needs of multiple structured data, Hadoop software is the best platform to meet this requirement (Figure 2).
The distributed storage and processing environment of the Hadoop cluster is ideal for large data ETL.
Hadoop first divides incoming data streams into fragments, and then quickly processes massive amounts of data through simple parallel operations. The software supports all types of data and can run between dozens of, hundreds of, or even thousands of servers to provide robust scalability. The Hadoop Distributed File System (HDFS) stores the results in low-cost storage devices (directly attached to each server in the cluster) for immediate upload to the enterprise Data Warehouse or to the unstructured data store.
Hadoop can handle multiple structured analysis data, even if the data is not predefined. In other words, Hadoop supports the schema on Read mode, which is diametrically opposed to the schema on Write pattern applied by traditional ETL processes. This allows Hadoop to load large amounts of data in a short time to ensure that data can be used quickly for analysis, visualization, and other purposes.
Infrastructure considerations
The intel® xeon® processor E5 Series Dual Server provides the best balance of capacity and cost for most Hadoop deployments. These servers provide more kernel, cache, and memory capacity than previous-generation servers. In addition, they increase I/O bandwidth by up to twice times, while reducing I/o latency by 30%. 1 These resources provide high throughput for performing more data-intensive tasks in parallel.
Lightweight I/O binding workloads, such as simple data sorting operations, may not require the full processing power of the intel® xeon® processor E5 series. This type of workload can be economically run on a high-density, Low-power server based on the intel® xeon® processor E3 series, or on a system chip (Intel® Ling ™soc) based on the intel® viiv™ processor. With a power consumption of up to 6 Watts, the Intel® ® system chip based on the 64-bit x86 architecture provides unprecedented density and energy efficiency in server-level processors.
Uninstalling ETL with Hadoop
Figure 2. By using the Apache hadoop*, you can extract, process, and export a large variety of data on a large scale.
All servers in the Hadoop cluster require large amounts of memory and a relatively large number of storage drives to meet the needs of data-intensive Hadoop workloads. Sufficient memory is also required to provide high throughput to perform multiple concurrent operations. Multiple storage drives (two or more than two per kernel) provide the required aggregate I/O throughput to avoid storage bottlenecks. Storage performance is significantly improved by equipping at least one intel® solid state hard Drive (intel® ®SSD) per server node.
By processing data near the location where the data is stored, Hadoop greatly reduces the need for large-capacity data movement. However, fast data import and export requires sufficient network bandwidth. In most cases, each server rack should use a Gigabit Ethernet (GbE) switch, and each rack-level switch should be connected to a GbE cluster-level switch. As data volumes, workloads, and clusters grow, you may need to interconnect multiple cluster-level switches and even uplink to a higher level of exchange infrastructure.
For more details, refer to the Intel White Paper published in the Intel developer area of the software.intel.com Web site: Use Apache hadoop* to extract, transform, and load (ETL) large data.
Using Model 2-interactive queries
The Data Warehouse provides a central repository for business data and BI functions, such as online analytical processing (OLAP) and data visualization. New and historical data are collected from different sources and will be used for interactive queries and other types of analysis.
The traditional data warehouse functions and resources may not meet the requirements of large data, so vendors have improved performance and scalability to address this issue. For example:
• The memory database eliminates the latency and overhead associated with transferring data back and forth between servers and storage systems. This method shortens the data access time from milliseconds to nanoseconds, eliminating bottlenecks that have hindered database performance for decades. Oracle TimesTen, * SAP hana*, Microsoft imub*, IBM soliddb*, VMware vfabric sqlfire*, and many open source solutions use this strategy to speed the processing and management of incoming data streams.
• Data Warehouse devices consolidate servers, storage, operating systems, database management systems, and support components into pre-built, highly optimized turnkey systems to simplify integration and significantly improve performance. Many data warehouse devices support memory databases, some of which contain proprietary data filtering techniques to speed up data flow. Most of these devices are large-scale symmetric multiprocessor (SMP) systems, or large-scale parallel processing (MPP) scalable blade server systems. out-of-the-box examples include IBM netezza*, HP euw appliance*, Oracle exadata*, Teradata UW device *, Dell Parallel uw* and Pivotal (formerly EMC greenplum) Data computing equipment *, and so on.
Companies looking to deploy large, cost-effective and powerful data platforms should consider combining large-scale SQL data warehouses with Hadoop clusters. The cluster can quickly collect and process large, diverse, fast-moving data streams. The appropriate dataset can then be loaded into the data warehouse for specific SQL queries, analysis, and reporting. Users can also use related software such as Apache hbase*, spark*, shark*, SAP hana*, Apache cassandra*, mongodb*, tao*, neo4j*, Apache drill*, or impala* And so on) to query the multiple structured datasets that reside in the Hadoop cluster. This hybrid strategy provides faster, deeper insights than the results of any single solution that can be implemented alone.
Whether you are currently using a traditional data warehouse or a more modern system designed for larger capacity and faster data flow, you can use a similar process: Collect data from external sources and then clean and format the data to conform to the warehouse data model. The process can be carried out before the data is loaded into the warehouse, or dynamically in the process of entering the stream data source into the warehouse.
After the data has been loaded, the analysis can begin. The modern data Warehouse supports the temporary query function, and can use any meaningful combination of values for on-demand data access. By contrast, a more traditional data warehouse can only generate pre-defined reports based on known relationships.
Infrastructure considerations
Whether you are integrating your own SQL Data Warehouse solution or evaluating related devices, the following considerations can help you significantly optimize scalability, reliability, and total cost of ownership (TCO).
Complex analysis performed in the SQL Data Warehouse does not typically scale well across a large number of cluster nodes, so a single data warehouse server must have high performance and scalability characteristics. Quad, eight, and larger servers based on the intel® xeon® processor E7 series provide excellent scalability and can be used to handle demanding analysis workloads. For example, enterprise Data Warehouse devices, including large SMP and blade MPP systems, use these processors to maximize overall performance and throughput. Each intel® xeon® ®e7 processor can provide up to 10 cores, 20 threads, and MB three cache. These processors also provide up to GB of DIMMs and up to 4 TB of memory in the 8-way server, and thus have a large memory database capacity.
Since the data warehouse is usually run on a single server, the normal running time of the system becomes especially important. The intel® xeon® Processor E7 series includes advanced reliability, availability, and serviceability (RAS) features built into the chip to support critical business-level reliability and protect data. All key interconnections, data storage, data paths, and subsystems integrate active and passive error monitoring capabilities.
Self-healing is the ability to proactively and passively fix known errors, and to fix them automatically based on configurable error thresholds, as well as reduce the likelihood of future errors. Intel cooperates extensively with hardware, operating systems, virtual machine monitors (VMM), and application vendors to help ensure tight integration of the entire hardware with the software stack.
With the rapid increase of data volume, enterprises can extend the data storage capacity more cost-effectively both inside and outside the data warehouse with the help of various new strategies. The following strategies can be used together to meet diverse needs at lower overall cost.
• Scale-out storage architectures provide affordable, oversized capacity and support for federation across private and mixed clouds. These solutions can be dynamically scaled so that they can be deployed faster than traditional storage systems. They can also help improve data management efficiency.
• Low latency, proximity storage is ideal for data-intensive applications and can be better run when co-located with data storage devices. Examples include business processes, decision support analysis, and high-performance computing workloads, as well as collaborative processes, applications, and WEB infrastructures that run on virtualized servers.
• Centralized storage in a storage area network (SANS) aggregation as a logical pool to support high-performance business databases. When optimized for low cost capacity rather than high performance, centralized solutions provide efficient storage for backup, archive, and object storage requirements.
Higher storage efficiency helps to control costs when responding to fast-growing data. Many storage vendors integrate the Intel® xeon® processor into their storage solutions to support advanced data management capabilities to help significantly increase efficiency. According to IDC's June 2013 global Storage and virtualization x86 environment 2013-2017 forecast, around 80% of enterprise-oriented, cloud-based and High-performance Computing (HPC) solutions are running on the Intel architecture. Storage platforms that support data efficiency technologies include:
• Duplicate data deletion to conserve capacity.
• Data compression to increase throughput.
• Automatically streamline configuration to increase utilization, supply storage on demand based on project requirements, rather than supply excess capacity.
• Intelligent tiering to optimize performance and cost, automatically migrate "hot" data to faster storage devices and migrate "cold" data to drive with higher capacity and lower cost. With this approach, a handful of high-speed drives, such as the intel® ®SSD 710 series SATA, can deliver higher performance at relatively low cost.
The rapid and efficient loading of datasets into the Data Warehouse can help analyze applications to provide business insights in a timely manner. The solution is made up of two parts, one of which is efficient ETL processing, and the other is a fast, efficient network that extends the growing value of analysis business to the entire enterprise. Intel (r) Ethernet products integrate a number of related technologies to meet these requirements.
• Near-native performance in virtualized environments. Virtualization increases the flexibility and utilization of the infrastructure, which is important for controlling costs while achieving growth in large data solutions. Intel® Virtualization Connectivity Technology (intel® ®vt-c) can help reduce I/O bottlenecks and improve the overall performance of the server in virtualized environments. Its Virtual machine device queue (VMDQ) technology unloads traffic classification and routes to a dedicated chip in a network adapter. With Pci-sig single I/O Virtualization (IOV) technology, a single intel® Ethernet server adapter port can support multiple stand-alone connections to virtual machines.
10 GbE Unified Network. Consolidating data center traffic into a single, high-bandwidth network helps reduce costs and complexity and delivers the required performance and scalability to meet fast-growing requirements. The Intel Ethernet Fusion network adapter supports Ethernet Fibre Channel (FCoE) and ISCSI to simplify implementation and reduce costs when consolidating local area network (LAN) traffic and storage area networks (SAN) traffic.
• Simpler and faster connection to ISCSI SANs. Intel Ethernet Fusion Network adapters and Intel Ethernet server adapters provide hardware-based ISCSI acceleration for improved performance. They also take full advantage of native iSCSI initiators that are integrated into the leading operating system to simplify ISCSI deployments and configurations in native and virtualized networks.
Using Model 3-based on the Hadoop platform
Predictive analysis
Predictive analysis obtains higher value from data by capturing relationships from past events and using these relationships to predict future results (Figure 3). By using predictive analysis, retailers can offer more attractive products to individual customers, medical institutions can choose the most appropriate treatment options, and financial services can increase return on investment and reduce risk. While predictive analysis can help develop strategic business plans, the greatest value may come from policy guidance during decision making and operational guidance in decision execution. The central team of data scientists, database administrators, and software developers work together to provide customized solutions for the most critical business operations. As companies increasingly integrate this functionality into their respective businesses, they must provide optimized decision-making tools for a broader range of optimization and automation systems.
Predictive analysis is divided into two main categories: Regression analysis and machine learning.
• Regression analysis techniques predict the most likely results by comparing current data with historical models.
• Machine learning adopts artificial intelligence technology, requiring little or no human intervention. By analyzing Representative datasets, the system extracts relationships and generalizes them, and forecasts them based on new data. Optical character Recognition (OCR) is a typical example, but new applications will explore the value of large data in a wider context.
Intel IT Department began a pioneering large data analysis in 2010, and recommended the use of the two models discussed in this article to create a hybrid analysis infrastructure (Figure 4).
1. Deploy data Warehouse equipment based on MPP architecture to quickly perform complex predictive analysis for large datasets. Many vendors have integrated the intel® xeon® processor E7 series into blade devices to deliver the required performance at relatively low cost. These systems apply to existing enterprise BI solutions and provide integrated support for advanced analysis tools and applications, such as R, which is a popular open source statistical computing language among data scientists.
2. Add a Hadoop cluster to the data warehouse to enable fast, scalable, and affordable ETL. Hadoop also runs other data processing and analysis features that work smoothly in a distributed processing environment. The Hadoop ecosystem provides an ever-expanding array of tools and components to meet these needs.
Infrastructure considerations
To provide maximum flexibility, data warehouses and Hadoop clusters should use high-speed data loaders and are connected together using a GbE or another High-bandwidth network technology. This enables you to quickly migrate data between two environments so that you can use the most efficient analysis techniques based on specific data types, workloads, and business requirements.
Business value of predictive analysis
Figure 3. According to Gartner's report, as the focus shifted from hindsight to foresight, the difficulty of analysis and business value increased simultaneously.
Intel IT Department's large data analysis hybrid platform
Figure 4. The Intel IT department's large data platform provides a flexible basis for analysis (including predictive analysis) by connecting large scale parallel processing (MPP) Data Warehouse equipment to the industry standard server cluster running Hadoop software by using high-speed data loaders.
Create a better foundation for large data analysis
With the development of large data technologies and solutions, Intel products and technologies can help accelerate innovation throughout the ecosystem. By working with hardware, software, and service providers to ensure broad support, Intel can help organizations more easily and economically integrate these new capabilities into standards-based, interconnected, managed, and secure architectures.
Processor performance Improvement
The progress of Intel processors delivers increased performance and value for the next generation of large data solutions. Continuous improvements in per-threading performance, parallel execution, I/O throughput, memory capacity, and energy efficiency help companies leverage the economy and mainstream computing systems to meet fast-growing needs.
New tools and Optimization software
Intel can work independently and work with leading software vendors and open source communities to provide optimized software stacks and services for large data analysis. These efforts contribute to the provision of new advanced functionality across large data ecosystems. They also ensure optimal performance for large data applications running on the Intel architecture.
Intel also provides software products to help meet some of the most critical requirements in large data ecosystems.
• Performance metrics for Hadoop clusters and applications. The intel® ®hibench Suite includes 10 performance metrics that are used by IT organizations and software vendors to measure the performance of specific common tasks, such as sorting and word counts, and more complex realistic features such as Web Search, machine learning, and data analysis. Intel engineers use the intel® ®hibench Kit to help implement upstream Hadoop optimizations for the intel® architecture and Java optimizations for Hadoop.
Implement large data analysis
Intel is integrating predictive large data analysis into its existing business intelligence (BI) environment to help improve business efficiency and performance. Intel IT Department is conducting a number of large data concept verification deployments with the Joint Business Division. Current focus areas include malware detection, chip design validation, market intelligence, and referral systems.
• Enterprise-ready Hadoop release. Intel® Release Hadoop provides the latest Intel architecture optimizations in the package to simplify deployment and support enterprise-class security and manageability requirements. Many optimizations will first be applied to the intel® release and will then be submitted to the open source Apache Hadoop project.
• Fast, large-scale, scalable, distributed file systems. The intel® ®luster storage software is an Intel optimized release of the lustre* Distributed File system, which supports large-scale cluster computing. The software extends support for tens of thousands of client systems and dozens of of petabytes of storage, delivering more than 1 TB of aggregated I/O throughput per second.
Advanced power management to reduce operating costs
For most organizations, storing and analyzing large data requires a great deal of infrastructure, which requires managing energy consumption to control total costs. The energy efficiency features of the intel® xeon® processor and intel® itanium® system chips can help resolve this problem. Whichever you choose, the software supports the Intel® xeon® processor and intel® itanium® processor family without recompiling to help you avoid the complexities of managing multiple schemas and code libraries.
Intel offers a variety of tools to help you manage power consumption more efficiently.
• Efficient data center power management. Intel Data Center Manager (intel® ®DCM) can be deployed to the existing management console and will take full advantage of the Intel processor's built-in measurement technology to provide advanced power and thermal management, from single servers and blades to racks, rack columns, and entire data centers.
Integrated energy Management in linux* environments. The operating average power limit (RAPL) Linux kernel software driver developed by Intel provides strong support for power monitoring, management, and throttling of the intel® xeon® processor E5 series.
Conclusion
No matter where the business comes from, there is a strong competitive advantage as long as you have the ability to capture, store, and analyze data from all sources, but the large data tsunami has brought more complex new infrastructure challenges. The three usage models mentioned in this article can instruct enterprises to transform large data into business value.
• Deploy Hadoop to ingest large data and prepare for analysis.
• Connect your Hadoop cluster to a fast, scalable data Warehouse for interactive query capabilities that support mixed data.
• Add predictive analysis and machine learning applications to make accurate predictions and take immediate action.
Intel's many innovations in the chip and software areas provide multiple optimizations and targeted functionality to help you implement these and other large data usage models more easily and efficiently.