What are the strategies for streamlining cloud data?

Source: Internet
Author: User
Keywords us or if which
Tags analysis cloud cloud data clouds content copy cost data
Over the year, the unit cost of disk space is becoming "low" in every case. Since it takes only a few 50 dollars to buy a 1TB hard drive, it's usually a bit of a chicken to talk about throttling in storage.

But in the clouds, things are completely different. If we keep too much worthless data or a copy of the document, then huge spending will come in two ways. The first is the monthly storage overhead, followed by the low performance associated with search, view, report, and dashboard upgrades. In the cloud, trimming a dataset can actually bring tangible benefits.

The overarching agenda now is to assess the problem: is our storage system primarily a document or tabular data? The two typically bring different types of storage constraints, and the strategies and tools used to respond to them are also quite different.

Documents usually exist as attachments to records (for example, contracts signed in PDF files that are often attached to related businesses), so it is often difficult for users to find them quickly. This feature makes it possible for the same document to be invoked frequently at the same time as three to four different records. We also need to find multiple versions of the document that have been modified several times in the short term. The first thing to do is to sum up each document in the system, to form an output list (including the ID record attached to the document and the date of the last update), and to use the spreadsheet filter to find duplicate documents. There are specialized duplicate file detection tools that can help in this area (by detecting file content), but I haven't heard of a tool in the cloud that implements the same functionality. Unless you are willing to download all the files to your local server and analyze them in depth, we will face a very heavy workload in this way. Because the optical storage medium is very inexpensive, we might as well file the data directly in the cloud, and then empty the cloud storage as a whole, lest someone complain in the future.

Tabular data is completely different because different types of cloud use a number of system-specific methods and techniques in dealing with such data. In other words, the common processing process is as follows:

• Determine which of your cloud systems do have storage problems. Some systems, such as accounting systems, cannot be trimmed at all, as the relevant staff need to regularly review and keep all the details of the long term. Other systems, such as marketing automation or log analysis systems, often collect a lot of details at run time, and they are the culprits that cause the system to slow down.

• Determine which tables consume more than 20% of our total storage. They are the focus of the trimming.

• Learn the value of individual records for each table. Some forms (especially accounts or contracts) can hardly be changed because their content is very important and will have a significant impact once they are cleared (especially when the tables are integrated with the external system). Other forms, such as anonymous information, especially those in the marketing automation system, can often be handled arbitrarily.

• Make a full backup of the data in the cloud first on disk or optical media before taking further steps. I am here to remind you in the most solemn manner that this step must not be overlooked.

• For tables that can be trimmed at will, evaluate the signal-to-noise ratio (the ratio between useful and unwanted information) first. What information has become completely worthless because of chipped? For example, in marketing Automation or Web monitoring cloud, who would really care if an anonymous visitor hadn't been there again for six months? Delete all negative content. I am sure you will want to conduct a comprehensive analysis of the affected users first, but remember that the ultimate goal of trimming data in Snr is to quickly clean up millions of records in a short time.

• Some forms have a good signal-to-noise ratio, but there is no need for many of the details stored in them. For example, many marketing automation and e-mail push systems use active tables to record important messages and web interaction. These active tables may consume half of the system storage space. But someone saw a video a year ago today, and watched video b the day before, how much does this information mean? You might as well use a standard of judgment: if a particular detail doesn't change anyone's decision or behavior, then it's not a "message." In view of this, we recommend a compression approach: keep the information, but clear the six months and prior to the various details. The history is usually stored as a custom table, description tag, representative string, or even a bitmap, which is a less-demanding form of storage space. To trim it requires careful thinking, user input, and custom code development, although the process is not easy, but eventually we will get a set of information value-oriented continuous trimming mechanism.

• Some forms (especially information and contact persons) tend to collect a large amount of repetitive information quickly, especially since everyone's company has a system that deals specifically with information and contact matters. If you have a cloud system that supports duplicate data removal tools (typically from mainstream service providers or third parties), buy a good word-of-mouth and really use it. The ideal tool has a fuzzy logic algorithm that helps us find and merge duplicate information without moving the data in the cloud. The entire consolidation process preserves data as much as possible, but if you have a large number of data conflicts in your cloud (for example, storing two completely different phone numbers for the same contact), we would probably need to create a shaded area and populate it with different data before merging. For a number of complex reasons, data consolidation must be done in phases: It takes up a lot of CPU time, and it adds a lot of burden to our minds, but in the final analysis, it clears up 100,000 of repetitive information. Don't be too impatient to merge this kind of work but there is no undo feature available.

Much of the above mentioned is a one-time fix, rather than a long-term mechanism that integrates change into the day-to-day process. If you don't want to invest to improve your data management process, be prepared to do a trimming of the above steps each quarter. And keep in mind that these steps will always haunt you if you don't introduce a long-acting mechanism.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.