hadoop--deal with Big Data darling

Source: Internet
Author: User
Keywords Large data which they expressed

Key drivers of large data growth include the dramatic growth of mobile devices and social networks, the digitization of paper-based documents and the growth of scientific data. "Their archive files are growing at a high speed because they don't know what needs to be saved and which don't need to be saved," said Rob Peglar, US chief technology officer at EMC's Isilon company.

Companies in various industries are investing a lot of money in big data analysis. One industry is the medical industry that analyzes patient data for diagnosis and treatment. Peglar said that in the financial services industry, the qualitative analysis of data has also been a concern. In this regard, the analysis of the messy stock trading data. The application of large data analysis in the manufacturing and design industry has also increased, but the fastest growing market is the social media and E-commerce industry that analyzes user behavior data to obtain market intelligence information.

hadoop--deal with Big Data darling

Among the many large data analysis platforms, the most well-known is the Apachehadoop platform. This is an open source software framework for distributed computing. Hadoop originates from Google's MapReduce software framework and the proprietary file system Google uses for Google file system. Participants in the Hadoop program include Yahoo, LinkedIn, Facebook, Twitter, the University of California, Berkeley, and Last.fm.

The growth of large data analysis needs is largely driven by messy data, which is what Hadoop is good at. "This is a supplement to the traditional structured database," Peglar said. "Structured data is growing, but messy data growth is much faster." ”

Unless you can deploy a large data application from companies such as Isilon, Teradata, or Oracle, it is likely that the company will deploy the computing infrastructure of a hadoop-like analysis platform on commercial hardware. "Hadoop has made some assumptions about the infrastructure platform it runs," said Rachelchalmers, a senior analyst with the 451 group. ”

Because Hadoop is based on Google's MapReduce, it is envisioned to run on a homogeneous commercial infrastructure similar to Google's. In addition, it knows which CPUs will be used for server hard disk storage, "Chalmers explained.

No common platform for large data processing

Which type of analysis system is deployed depends on the specific needs of the customer. Jimdietz, director of product Marketing at Teradata, the leading company in data analysis, says some customers need more processing speed than they need to deal with. In this case, they will buy an ultra-high performance application. For customers who want to store and analyze thousands of of user behavior data, they need solutions that store a wide range of data and have high processing performance.

For professional storage vendors, "We are more cooperative than big data vendors, not competition, because the product architecture is different." Hard disk manufacturers to provide the hard disk to professional storage manufacturers, professional storage manufacturers will be more than one hard drive to provide it storage vendors, it storage vendor's task is to integrate these basic storage media into an application system to the upper end of a large number of data storage, exchange, analysis and protection.

In addition, the solution design deployed by the company must be flexible enough to meet the strong growth requirements of the future.

Large data means high density

Peglar said that the primary factor that could affect the deployment of large data center managers (especially messy, large data) is storage. The size and power requirements of these storage arrays depend on their efficiency in the use of energy and available storage space.

Plan to implement large data analysis

"For example, IT staff will need to consult with business units to see if these business units have the need to deploy large data applications, such as Hadoop, as a result of which no department is interested." "said Colt. "Without specific business requirements or applications, it becomes just a simple technology." ”

The companies that have launched big data analysis are often in the financial services and health care sectors where a lot of data can be used to boil down trends and best practices, says Colt.

The Theinfopro Company conducts a popular technology index survey every year, asking hundreds of IT pros about their technical planning issues. The company's latest investigative activity was conducted from August 2011 to April this year.

Unsurprisingly, respondents chose server virtualization technology as the main driver of enterprise capacity growth, and Fibre Channel Sans are the primary targets of enterprise data storage. 67% per cent of respondents said they were connecting 80% to 100% of production servers to Fibre Channel Sans.

Drive capacity Growth

Last year, however, the Fibre Channel hard drive market took a hit as SAS, SSD and SATA drives gradually dominate the enterprise.

If asked what new disk storage devices were purchased in 2011, 48% per cent of respondents said they had purchased Fibre Channel drives, 31% for SATA drives and 19% for sas,2% SSD solid-state hard drives. However, when asked about the equipment that these companies purchased this year, 41% per cent said they had purchased SAS drives, 35% SSD solid-state drives, 23% SATA drives and 11% as Fibre Channel.

Drive type purchased by the enterprise in 2011

The survey found that the number of companies planning to deploy SSD technology jumped from 7% last year to 37% this year.

"This is a very big leap, after all, these companies have not planned to use SSD solid-state drives before." This percentage fell from 42% to 23%. "Kurt said.

Mixed arrays and SSD solid-state drives

Most enterprise data centers use a hybrid array of SSD solid-state drives and rotating disks, while the new data centers are full of solid-state arrays and server SSD solid-state drives. EMC is the first supplier of solid state hybrid arrays, well ahead of competitors. Following EMC are NetApp, Hitachi Data Systems (Hitachi), IBM, HP, Oracle and Dell.

Which vendor's products are using a hybrid array solid-state drive?

When asked which suppliers provided solid-state storage servers, Fusion-io topped the list, followed by IBM, HP, Oracle, Dell and Seagate. EMC is nineth on the list, lagging behind NetApp.

The survey showed that the top solid suppliers were: Fusion-io, Purestorage, Nimbusdata, Nimblestorage, Gridironsystems and Kove. In addition to producing all DRAM equipment, Kove also sells interface flash cards and all-flash arrays or appliances.

The preferred supplier of solid state servers is Fusion-io, followed by IBM, HP, Oracle and Dell. Seagate ranked sixth.

When asked if they would implement the All-flash array, 7% of respondents said they were already using the technology, while 86% per cent said there were no plans to implement. In addition, 4% of respondents said they planned to buy the All-flash array, but should be in the next six to 18 months time. 2% of respondents said they would implement the plan after 18 months.

Mattwotters, an architect of the company's infrastructure for the Mitsubishi Power Systems Americas Branch, was not involved in Theinfopro's investigation. But he said his company installed a all-flash array from Nimbusstorage last December to address the I/O slowdown in their enterprise's SAP environment.

Wotters initially tried to put the second set of processors into all of his SAP servers, upgrading the memory to its maximum capacity, but the problem persisted. He finally went back to the TB size of the SAP database and the primary storage load data to shrink the performance problem, which was adopted at the time by an HP Eva array. Increasing the cost of SSD to Eva is more expensive than buying all-flash arrays, Wotters said.

The Nimbus Company's array supports 2TB of storage capacity, costing about 40000 dollars, he said.

Nimbus's superior fast-flash storage array eliminates the bottleneck of the database and also cuts back the time from four hours on Eva to a new flash array of 15 minutes.

"So far, I have been very satisfied with the performance." Not a single fault. "he said. I think of an array in the morning, the afternoon can be online. It's as simple as that. ”

Wotters environment Isolation Flash storage to an application: SAP. However, those involved in the Theinfopro company survey found that the ability to automatically tier data, or migrate data between individual drive types in the array, became the hottest storage technology.

Automatic layering moves data between high-performance solid-state drives, or moves from hard disks to high-capacity, low performance disks as data access frequency decreases.

"One of the basic components is the disk drive and the efficiency with which the selected platform is used for that disk space," Peglar said. At present, power density is a major focus of computing in the deployment of a large data analysis platform, and the power consumption of disk arrays is increasingly a focus of attention, he said.

This focus is mainly due to the size of the storage array required and the growth rate of these storage requirements. Peglar found that in just a few years, some clients ' clusters grew from 1 bytes (petabyte) to 5 bytes (petabyte).

The density of large data compute node deployments can be very high. The power densities of these deployments can reach 3kW or higher, Peglar said. This is significant for data center managers to determine the power and refrigeration infrastructure specifications.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.