It's really tempting to get business on big Data technology, and Apache Hadoop makes the temptation more violent. Hadoop is a large scale scalable data storage platform that forms the foundation of most large data projects. Hadoop is powerful, but it requires the company to invest a lot of learning energy and other resources.
If you get the right application, Hadoop can actually improve your company's business, but the application of this hadoop is fraught with thorns. On the other side, many companies (not Google, Facebook or Twitter, of course) do not have the same volume of data that requires a giant Hadoop cluster to analyze, and they are simply attracted by the popular phrase "big data".
As Dabid Wheeler said, "All problems in computer science have another level of indirect solution," and Hadoop is a similar indirect solution; When your boss is attracted to some popular words, making the right software architecture decisions will be very difficult.
The following are some alternatives that need to be tried before you invest in Hadoop:
Know Your data
The total product of the data
Hadoop is an effective solution for large datasets.
GB File System HDFs. So if your files are only MB, you'd better consolidate several files (zip or tar) to hundreds of megabytes or gigabytes. HDFs will split the file and store it in 64MB, 128M, or larger chunks.
If your dataset is very small, it will not be appropriate to use this giant ecosystem. This requires a good understanding of your data and an analysis of what types of queries are needed and whether your data is really big enough.
On the other hand, given that your calculation instructions may be large, there may be errors in measuring the volume of data only through the database. Sometimes the arrangement of mathematical calculations or the analysis of small datasets may make the results much larger than the actual volume of data, so the key is that you have a real understanding of the data.
The speed of data growth
You may have terabytes of data in a data warehouse or other data source, but one factor to consider before building a Hadoop cluster is the growth of data.
Ask your analyst a few simple questions, such as:
How fast is the
data growth? Are these data growing at a very fast rate? How large will the volume of data be in a few months or years?
Data growth for many companies is yearly. In this case, your data growth rate is not fast, so it is recommended that you consider archiving and clearing options instead of running directly to Hadoop.
How to reduce the data that needs to be processed
If you do have very large volumes of data, you may want to consider reducing the data to a very manageable size by following these options, which have been tested by industry for several decades.
Consider archiving
Data archiving is a separate storage of outdated data, and of course the time required for storage is based on actual requirements. This requires a very good understanding of the data and the use of the data by the application. For example, electronic commerce company's large data processing only 3 months of data into the active database, and old orders are stored in separate storage.
This approach can also be applied to your data warehouse. Of course you can store more recent data for reporting and querying, and using infrequently used data can be stored in separate storage devices.
Consider purging data
Sometimes we are busy data without knowing exactly how much data we need to keep, and if you store a lot of data that you don't use, it will definitely reduce the speed with which you can effectively handle your data. Figuring out your business needs and reviewing whether the data can be deleted, and analyzing the type of data you need to store, not only saves your storage space, but also increases the speed at which efficient data is analyzed.
A common best practice is to create additional columns for the data warehouse, such as Created_date, Created_by, Update_date, and updated_by. These additional columns allow you to perform periodic access statistics on the data so that you can see the effective cycle of the data. Here you need to focus on the logic of data removal, remember to think first and then realize. If you use an archiving tool, data cleanup will become very easy.
Not all data is important.
You may not be able to resist the temptation to store all your business-related data, and you may have many sources of data, such as log files, marketing activity data, ETL operations, and so on. You need to understand that not all data is critical to the business, and it is not beneficial to keep all the data in the Data Warehouse. Filter out unwanted data at the data source, even before it is stored in the Data warehouse. Don't store all the data, just analyze the data you need.
Notice what data you want to collect
With the online video editing business, do you need to save all the actions your users are doing? This can result in a very large volume of data, and if you find that your data warehouse is not sufficient to handle the data, you may consider storing only the meta data. Although video editing is a very extreme example, it does not prevent us from considering this information in other use cases.
To sum up, only the data needed is collected according to the needs of the business.
Intelligent analysis
Hire an analyst who understands the business
So far, you should have a clear understanding of the importance of the data, so after you've done all the above and decided to use Hadoop, hiring 1 of analysts who understand the business will be of great help to your business.
If the data analyst doesn't know how to get value from it, then Hadoop will have no effect, and don't skimp on investing in employees who have a deep understanding of the business. Encourage them to do more experiments and use new ways to analyze the same data and find ways to use the existing infrastructure for profit.
Use statistical sampling for decision making
Statistical sampling can be said to be a very old technique that researchers and mathematicians use to infer reasonable conclusions on large volumes of data. With this step, we can drastically reduce the volume of data. Instead of tracking billions of or millions of data points, it's just a matter of tracking thousands of or hundreds of of the data points. Although this method does not provide us with accurate results, it can have a high level of understanding of large datasets.
Lifting Technology
Have you really reached the limit of relational database processing?
Before exploring other areas, you should also look at whether relational databases can continue to handle problems. Traditional relational databases have been in use for a long time, and many organizations can already use it to manage TB-level data warehouses. So before you move to Hadoop, consider the following methods.
Split data
Data segmentation is a logical or physical division of data into several better maintenance or access sections, while many popular open source relational databases support fragmentation (such as MySQL partitioning and postgres partitionging).
Consider database fragmentation on traditional database
Database fragmentation is the last resort to improve the performance limit of traditional relational database, which is suitable for data that can be logically fragmented on different nodes and seldom do cross node join sharing. In a network application, it is a common way to improve performance based on user fragmentation and storing user-related information on the same node.
Fragmentation has many limitations, so it is not suitable for all scenarios, and there are too many cross node jion in use cases, and fragmentation will not work.
Summary
The deployment of Hadoop will cost the company a huge amount of human and material resources, and it would be a good idea to achieve the goal by upgrading the existing infrastructure.
(CSDN)
(Responsible editor: The good of the Legacy)