An average company spends $2.1 million a year on unstructured data processing, according to a survey of 94 large U.S. companies from the Novell Ponemon Institute, which has the highest cost for some tightly regulated industries, such as finance, pharmaceuticals, communications and healthcare. Will reach 2.5 million dollars a year; another survey from Unisphere research showed that 62% of respondents said unstructured information was unavoidable and would surpass traditional data over the next 10 years. In addition, 35% said that over the next 36 months, unstructured information would exceed traditional relational data.
According to IDC, global data volumes are now doubling every 18 months, and the amount of data produced globally has reached 40EB a year (1EB=1000PB). And these crazy growth figures come mainly from unstructured data.
The fact that structured data has become mainstream has long been a sign that, in 2008, file-based Storage-system capacity shipments, for the first time, surpassed shipments of block-based storage-system capacity, and in recent years the gap is widening, according to Gartner, and by 2012, file-based Storage-system capacity will account for 70% of total capacity. IDC also predicts that, given the rapid growth in file-based unstructured data, 80% of total shipments in the global storage market will be covered by file-level data by 2012.
Obviously, for enterprises with unstructured data processing needs, it is necessary to face up to the problems it brings.
What is unstructured data?
Unstructured data is relative to structured data, and structured data refers primarily to data that is digital or can be represented in a unified structure, such as data stored in a database, which is essentially rendered as block. Unstructured data refers to data that cannot be represented by a digital or unified structure, such as text, images, video, audio, reports, Web pages, and so on, which are mostly in the form of files (file).
In fact, there are two main reasons for the proliferation of unstructured data: First, the arrival of the cloud era makes the theme of data creation gradually to the user individual, and the majority of the data produced by individuals are pictures, documents, videos and other unstructured data; The popularization of information technology enables enterprises to achieve more office flow through the network, in the past, paper forms, bills and so on have realized the digital archive, and the data produced in this regard are mainly unstructured data.
A Web page, for example, is often considered a typical unstructured data, although basically all pages are made up of HTML language and are rich in structural definitions. But Web pages also contain links and references to external content that is often unstructured, such as images, XML files, animations, and so on.
In addition, unstructured data is common in customer relationship Management (CRM) systems, especially for customer service representatives and call center staff.
How do we deal with "big Data", which consists of unstructured data and traditional structured data?
Obviously, integrating all this data will require innovation. 40 years ago, data management systems required more advanced programs to manage all data types, both structured and unstructured, and to meet the needs of distributed data deployments anywhere on the global network.
Unstructured data--raid mode obsolete
In traditional solutions, structured data access is a small data intensive way, a database write read out of the amount of data produced only a few bytes or several KB, but it requires very intensive access, for a large enterprise database, the number of calls per second will generally reach dozens of hundreds of times, So the metrics for database storage devices are IOPS, which is the amount of I/O that can be completed in one second.
Therefore, in order to find the fastest query speed, the enterprise begins to deploy SSD drives with greater I/O throughput. But the new problem began to appear, with the SSD made of the upgrade (72nm->50nm->32nm->25nm), the single point of the number of erasable is falling, for mlc,50nm 10,000 times a single point, 32nm only about 5,000 times, The latest 25nm single point is less than 3,000 times.
The performance improvement also means the decline of reliability, which is a dilemma.
Although the software ability to improve the erasure and wear equalization algorithm, but this generation of SSD products is not significantly improved life. However, the process of upgrading, capacity is also greatly improved, users can be redundant more space for life, but this is not the best way to solve unstructured data.
Alibaba senior DBA Rui said that if a system, although the design of RAID, but the bad disk, the rebuild will take more than more than 10 hours, and the overall system performance degradation is very obvious, users are simply unacceptable, so consider the system architecture is not always the best consideration, It should be considered from the worst.
In a sense, he says, it is important to think about the impact of bad disks, bad nodes, bad paths on the system, and how to recover quickly after the damage is being redesigned.
Obviously, the best option for handling structured data remains raid, after all, RAID technology is popular with users for larger-capacity hard drives and cheaper, stable hard drives. But for unstructured data, when the storage server has more disks and more capacity, the current RAID card technology may not really fit, a good unstructured data storage architecture will provide very large I/O throughput, that is, transmission bandwidth. The inevitable trend is that unstructured data processing will use distributed computing more and more.
RAID will not perish in any way, but the growing demand for a new generation of disks and next-generation storage is opening a new way to extend disk protection beyond raid. RAID may still be an important part of data protection, but it could be a complement to other technologies.
Future business intelligence--the need for mixed data
For an enterprise, unstructured data is used for BI (business intelligence) not only to analyze data, but more companies want to combine structured and unstructured data for analysis, and companies want to be able to analyze a wide variety of data streams: such as mixed data.
From the traditional data warehouse, their support for unstructured data is very good. As a result, the emerging architecture of data warehousing is the idea of storing unstructured data in a distributed architecture like Hadoop and doing basic analysis of the data. Finally, the summary information is delivered to the data warehouse being used for further analysis, and the enterprise can also be implemented by directly merging two different environments or through federated queries such as Hadoop.
But the real problem is that traditional BI tools do not support parsing and finding structured and unstructured data in the same query. Instead, you must use MapReduce or some other sql-based tool.
This does not mean, however, that there is no suitable tool to handle both structured and unstructured data. For example, Endeca latitude and cxair support mixed query capabilities for structured and unstructured data.
The two products implement different methods, but the basic idea is the same. is to extract the structure from unstructured data and then combine the structured data directly. These two products are very easy to use. It also allows users to focus on data rather than just generating reports.
At present, these two companies in their own market policy is still different. Specifically, latitude primarily develops analytical applications that support the browsing of mixed data. Cxair, however, prefers the traditional bi market.
But it seems that two vendors do not have a perfect solution for all the mixed data problems.
In common, they all explicitly choose the warehouse storage architecture. There is no doubt that the built-in Endeca and Connexica technologies and the ability to handle unstructured data are essential to the BI leadership.
Distributed architecture will be the final choice
The ability to process unstructured data is indeed necessary for large organizations, but for smaller companies, the potential problem is that the solution costs too much.
Whether cloud databases can overcome the scalability and performance problems that have plagued traditional databases for years. According to the current situation, in order to obtain data from the cloud database, data management technology is required to store all the data in the database in a centralized location. In addition, there is a serious limitation, the traditional data management technology in the management of unstructured data problems.
An alternative is to store the data in a data warehouse, such as Teradata's Aster data or EMC's Greenplum, which support all the functionality provided by the native MapReduce. But if you try to do this, you will experience extensibility problems.
Distributed computing solves the problem of extensibility perfectly, so at present almost all data warehouses and data analysis vendors are beginning to announce support for the distributed technology, which is represented by Hadoop or mapreduce (but all the commercial data Warehouse software is expensive).
Of course, another challenge for businesses is to make major changes to meet new challenges, including the cost of new architecture deployments, increased regulatory capacity and increasingly complex IT infrastructures.
In the cloud computing architecture, servers or storage devices will inevitably be more fragmented than they are now, which brings challenges to data management, distributed design, and performance. For example, a database management system that can query distributed data in data centers across multiple geographic locations is a new problem for companies in cloud computing popularity.
Traditional database management system can not meet the needs of cloud database management system. Most of the centralized architecture was designed 40 years ago. This prevents them from being effectively distributed and stored in the data center. In order to meet the most critical characteristics of cloud database management system, a distributed Peer-to-peer architecture is needed.
Enterprises need data management technology, can effectively obtain any format data, and distributed anywhere in the global network. No need to upload or download a large amount of data on the Internet, this will be the future of cloud computing network basic requirements.
(Responsible editor: admin)