Data Analysis ≠hadoop+nosql

Last Update:2016-04-30 Source: Internet

Author: User

Tags database sharding

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Data Analysis ≠hadoop+nosql

Directory (?) [+]

Hadoop has made big data analytics more popular, but its deployment still costs a lot of manpower and resources. Have you pushed your existing technology to the limit before going straight to Hadoop? Here's a summary of 10 alternatives that you can try before investing in Hadoop, saving time, money, and effort.

It's really tempting to get business on big Data technology, and Apache Hadoop makes this temptation even more drastic. Hadoop is a massively scalable data storage platform that forms the foundation of most big data projects. Hadoop is powerful, yet requires a great deal of effort and resources from the company.

If you get the right application, Hadoop does radically improve your company's business, but the path to Hadoop's application is fraught with thorns. On the other side, the volume of data for many businesses (not Google, Facebook, or Twitter, of course) is not large to require a giant Hadoop cluster for analysis, and they are simply attracted by the popular word "big data".

As Dabid Wheeler says, "All problems in computer science have another level of indirection," and Hadoop is a similar indirect solution; When your boss is attracted to some popular words, making the right software architecture decisions can be tough.

Here are some alternatives to try before you invest in Hadoop:

1. Get to know your data

1, the total volume of data

Hadoop is an effective solution for large datasets.

File system HDFs above the GB level. So if your files are only MB, you might want to consolidate a few files (zip or tar) to hundreds of trillion or a few gigabytes.
HDFs splits files and stores them in blocks of 64MB, 128M, or larger.

If your data set is very small, then using this giant ecosystem will not be a good fit. This requires having enough knowledge of your data and analyzing what types of queries are needed and whether your data is really big enough.

On the other hand, given that your calculation instructions may be large, there may be errors in measuring the volume of data only through the database. Sometimes mathematical calculations or the arrangement of small datasets can make the results much larger than the actual data volume, so the key is that you have a solid understanding of the data.

2) growth rate of data

You may have terabytes of data in a data warehouse or other data source, but before you build a Hadoop cluster, there is a factor that must be taken into account is the rate of data growth.

Ask your analyst a few simple questions, such as:

How fast is data growth? Is this data growing at a very fast rate?
How large is the volume of data in a few months or years?

Data growth for many companies is yearly. In this case, your data is not growing fast, so consider archiving and purging options instead of rushing directly to Hadoop.

2. How to reduce the data to be processed

If you do have a very large volume of data, you can consider reducing the data to a very manageable size by following several options that have been tested by industry for decades.

1) Consider archiving

Data archiving is the separation of outdated data, of course, the time of storage is based on actual needs. This requires a very good understanding of the data and the usage of the data by the application. For example, the big data processing of e-commerce companies only in 3 months of data into the active database, and the old orders are stored in separate storage.

This approach can also be used in your data warehouse. Of course, you can store more recent data for reporting and querying, and data with less frequency can be stored on separate storage devices.

2) Consider purging data

Sometimes we are busy collecting data without knowing how much data we need to keep, and if you store a lot of data that is not available, it will undoubtedly reduce the processing speed of your effective data. Figuring out your business needs and examining whether data can be deleted, and analyzing the types of data you need to store, not only saves you storage space, it also increases the speed at which data is analyzed.

A frequently used best practice is to create additional columns for the data warehouse, such as Created_date, Created_by, Update_date, and updated_by. These additional columns allow for periodic access statistics to the data so that the effective period of the data can be known. The focus here is on the logic of data cleansing, and remember to think before you do it. If you use an archive tool, the cleanup of the data becomes very easy.

3) Not all data is important.

You may not be able to afford to store all of your business-related data, and you may have many sources of data, such as log files, campaign data, ETL jobs, and so on. You need to understand that not all data is critical to the business, and it is not beneficial to keep all the data in the Data Warehouse. The data source filters out unwanted data, even before it is stored in the Data warehouse. Don't store all your data, just analyze the data you need.

3) Be aware of what data you want to collect

Will you need to save all the actions your users have made in order to get online video editing business? This can result in very large data volumes, and if you find that your data warehouse is not enough to handle this data, you might consider storing only metadata. While video editing is a very extreme example, it does not prevent us from considering this information in other use cases.

All in all, only the data needed is collected according to the needs of the business.

3. Intelligent analysis

1) Hire an analyst who understands the business

So far, you should have a clear understanding of the importance of the data; So when you've done all the steps above and decided to use Hadoop, hiring 1 of analysts who know the business will help you a lot.

If data analysts don't know how to derive value from them, then Hadoop will have no effect, and don't skimp on investing in employees who have a deep knowledge of the business. Encourage them to do more experimentation and use new ways to analyze the same data to find ways to monetize existing infrastructure.

2) Use statistical sampling for decision making

Statistical sampling can be said to be very old technology, and researchers and mathematicians use it to infer reasonable conclusions on large volumes of data. With this step, we can drastically reduce the volume of data. Instead of tracking billions of or millions of of data points, just track thousands of or hundreds of of those data points. This approach does not provide us with accurate results, but it can have a high level of understanding of large datasets.

4. Upgrading technology

Have you really reached the limit of relational database processing?

Before exploring other areas, you should look at whether relational databases can continue to handle problems. Traditional relational databases have been used for a long time, and many organizations can already use it to manage terabytes of data warehouses. So before you move to Hadoop, consider the following approach.

1) Split data

Data segmentation is the logical or physical partitioning of data into a number of better-maintained or accessible parts, while many popular open-source relational databases support sharding (such as MySQL partitioning and postgres partitionging).

2) Consider database sharding on a traditional database

Database sharding is the last trick to improve the performance limit of traditional relational database, and it is suitable for the situation that data can be logically fragmented on different nodes and seldom do cross-node join sharing. In network applications, it is a common way to improve performance based on user sharding and storing user-related information on the same node.

Sharding has many constraints, so it is not suitable for all scenarios, and there are too many cross-node jion in the use case, and sharding will not do any of the work.

Summarize

The deployment of Hadoop will cost the company a huge amount of manpower and resources, and it would be a better idea to reach the goal by upgrading the existing infrastructure.

Original: Http://www.csdn.net/article/2013-07-19/2816277-hadoop-when-to-use

Data Analysis ≠hadoop+nosql

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More