Hadoop is widely used in large data processing applications to benefit from its own natural advantages in the areas of extraction, distortion and loading (ETL). The distributed architecture of Hadoop, where the large data processing engine is as close to storage as possible, is relatively appropriate for batch operations such as ETL, because batch results like this can go directly to storage. The MapReduce feature of Hadoop enables you to break a single task and send a fragmented task (MAP) to multiple nodes before loading (Reduce) into the Data Warehouse as a single dataset.
But for Hadoop, especially the Hadoop Distributed File System (HDFS), large data processing requires at least three copies to support high availability of data. HDFs seems feasible for terabytes of data, but the cost of storage can be stressed when it comes to PB-level data. Even scalable storage can not avoid the pressure itself, some vendors choose to use RAID technology to achieve volume level of protection, and at the system level, the use of replication. Object storage technology can provide solutions to data redundancy problems facing large environments.
Object storage. object-based storage architectures can greatly enhance the benefits of scale-out storage by replacing a tiered storage architecture, using a single index to correlate flexible data objects. This solves the problem of unrestricted scaling, which improves performance itself. The object storage System contains the erasure code that eliminates the need for RAID or replication as data protection, which greatly improves the efficiency of storage usage.
Unlike the HDFS mode, which requires two or three copies of redundant data and additional RAID mechanisms, the object-storage-system erasure code can achieve a higher level of data protection with an additional capacity of 50% to 60%. At a large data storage level, saving on the storage itself will be significant. Many object storage systems can also be selected, including Caringo, Datadirectnetworkswebobjectscaler, Netappstoragegrid, Quantumlattus, and open source Openstackswift and Ceph.
Some object storage systems, such as Cleversafe, can even be compatible with Hadoop. In implementing these projects, the Hadoop software component can run on the CPU of these object storage nodes, and the object storage System replaces the Hadoop distributed file system of the storage node.
The bottom line of large data-processing storage
Large-data processing analysis has gradually become a hot topic in IT industry, more and more enterprises believe that it will lead the enterprise to success. However, there are two aspects to everything. The thing to look at is the existing storage technology itself. Traditional storage systems encounter bottlenecks whether they require extremely low latency response, real-time large data applications, or data mining applications that face massive data warehousing. In order to ensure the normal operation of large data analysis business, the corresponding storage system needs to be fast enough, scalable and cost-effective.
For Flash solutions, there are alternative solutions for high-performance, low-latency, and high-capacity storage, whether in the form of a server-side flash card or a full flash array. Object-oriented extensible architecture with erasable programming provides an option for more efficient and lower prices for storage structures that use traditional RAID and replication methods.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.