Mining business value from large data

Source: Internet
Author: User
Keywords Large data business value

Both in the public and private sectors, organizations and businesses are collecting and analyzing "big data" to more accurately forecast market trends and make smarter decisions to ensure success. They classify large amounts of data from a variety of sources, including weather forecasts, economic reports, forums, news sites, social networks, wikis, tweets and blogs, and then analyze the data further to understand their customers, operations, and competitors from a new perspective. Some companies even use predictive analysis to determine the opportunities and risks they may encounter in the next one months, year, or even five years.

However, big data is not just about opportunities, it also includes challenges. The traditional business Intelligence (BI) infrastructure is unable to handle the current number of large, diverse and fast-growing data streams. The Apache hadoop*, which runs on the intel® architecture, offers an affordable, powerful, and scalable infrastructure that can import and store large data and use it for analysis. This solution provides you with a solid foundation for achieving your goal value and can be scaled to meet growth requirements with virtually unlimited access.

Breaking the limitations of traditional ETL

Today's business intelligence systems use a variety of sophisticated technologies to transform raw data into useful business information such as online analytical processing (OLAP), data mining, process mining, complex event handling, enterprise performance management, predictive analysis, and assigned analysis. However, before you can analyze large data, you must extract and transform it from external resources to meet operational requirements, and then load it into the appropriate profiling environment-a process called extract, transform, and load (ETL).

Large data typically overwhelm the traditional ETL infrastructure. The inbound data stream is too large and growing too fast for processing within an acceptable time period. In addition, the variety of data is a challenge. Large data comes from various channels, such as text documents, pictures, audio, video, running logs, and sensors. These unstructured data types are not appropriate for traditional relational databases.

Apache Hadoop can provide a solution to the ETL challenge. Google has developed the technology for its popular search engine, which can be run on scalable industrial Standard server clusters that configure commercial storage devices. With distributed storage and large-scale parallel processing, the Apache Hadoop cluster has excellent scalability to handle several PB of aggregated structured data.

Rational planning of ETL infrastructure for greater efficiency

ETL workloads are constantly changing, so a well-designed Apache Hadoop cluster is critical to achieving performance goals in the most economical way. The Intel architecture offers a variety of options to help you implement the most appropriate solution.

• Provide economic performance for mainstream ETL workloads. From a cost-benefit perspective, a dual-slot server based on the intel® xeon™ processor E5 family is the best choice for most Apache Hadoop workloads. These servers provide higher performance and are more efficient for distributed computing environments than large-scale multiprocessor platforms. In addition, they provide more efficient load balancing and concurrent throughput compared to smaller single slot servers.

• Better cost model for lightweight ETL workloads. Some ETL workloads (simple data classifications) do not take full advantage of the Intel Xeon Processor's processing power. In general, you can run this lightweight workload more efficiently on a micro-server that is based on the latest intel® viiv™ processor. These server-level processors have a power consumption of only 6 watts, which provides efficient new data center efficiencies for processing less demanding applications.

Both the Intel Xeon processor and the Intel Ling processor support ECC memory, allowing for automatic detection and correction of memory errors. Memory errors are one of the main causes of data corruption and server downtime in the data center, and the well-designed Apache Hadoop cluster has a large amount of memory (typically a gigabyte or more of memory per server), which increases the risk of error, so ECC memory becomes an essential feature.

Uninstalling ETL with Hadoop

With the Apache hadoop*, organizations can import, process, and output different kinds of data on a large scale.

In an Apache Hadoop cluster, storage performance is as important as processing power. The standard mechanical hard disk can satisfy the processing demand of a large amount of workload only if the quantity is enough. The intel® Solid state disk (intel® ®SSD) provides a higher throughput rate with a shorter latency. Intel tests show that using the Intel SSD replacement mechanical hard drive can increase cluster performance by as high as 80%.

In addition, network performance is critical to ensure efficient import, processing, and export of large datasets. Intel provides an affordable High-bandwidth Gigabit Ethernet (GbE) server adapter that helps you expand easily to support cluster growth. As the cluster continues to expand, you can connect multiple GbE switches and uplink to a faster network infrastructure.

Reduce your operating costs

According to a CIO survey conducted by Gartner in 2007, 2010 and 2013, more than 70% of CIOs believe that growing power and cooling requirements are the biggest data center challenges they face. 2 The energy efficiency of the Intel Xeon processor, the Intel Ling processor, and the Intel SSD can help reduce the data center load and budget. In addition, Intel provides advanced power and thermal management applications, the Intel® Data Center Manager (intel® ®DCM). Intel DCM can use the built-in utilities in the Intel (r) processor. You can use it to monitor the power consumption at all levels, from individual servers to the entire facility, thereby minimizing power consumption without impacting performance.

Reduce your risk

Open source Apache Hadoop software can be obtained free of charge from the Apache Software Foundation. In addition, enhanced software distributions are available free of charge from value-added distributors such as Intel. These enhanced distributions provide additional functionality, services, and support packages to help simplify implementation and reduce risk.

The Apache Hadoop software intel® release is an open source product that includes Apache Hadoop and other components, as well as enhancements and fixes provided by Intel. This software is highly optimized for the latest Intel Xeon processors, Intel SSD storage devices, and intel® ®10 GbE network adapters. Tests show that federated platforms are 30 times times more efficient than general-purpose Apache Hadoop software, which runs on less-optimized hardware platforms.

Intel distributions provide integrated support for major enterprise requirements, including:

• Data confidentiality. Hardware-accelerated encryption and granular control enable you to securely integrate sensitive data types without impacting security, compliance, or performance.

• Scalability and availability. multi-site Extensibility and adaptive data replication simplify integration and ensure that you have access to critical data and insights at all times.

• Advanced analysis. Intel®graph Builder and Integrated support for R (open source applications that can be used to perform static analysis) can help data analysts and developers gain higher value from large data.

• Services, support and training. Intel provides extensive online training resources and provides professional support to plan, implement, and maintain the Apache Hadoop deployment based on Intel's distribution.

Conclusion

Large data brings new business opportunities and challenges to all industries. The challenge of data integration (integrating social media and other loose data into traditional business intelligence environments) is the most pressing problem for CIOs and IT managers. Apache Hadoop provides an economical and scalable platform for easy import and analysis of large data. Uninstalling traditional ETL processes with Hadoop can reduce the analysis time by hours or even days.

Running the Hadoop cluster efficiently requires choosing the best server, storage, network, and software infrastructure. Intel can provide software and hardware platform components to help you design and deploy efficient, high-performance Hadoop clusters for large data ETL optimizations. In addition, Intel provides a wealth of reference architectures, training, professional services, and technical support to help you accelerate deployment and reduce risk.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.