Big data technology is indeed a very attractive business for businesses, and Hadoop is a massive scalable data storage platform that forms the basis of most big data projects. The advent of Apache Hadoop makes this temptation more attractive . Hadoop is powerful, but not touching, requiring companies to invest a lot of learning and other resources.
Hadoop does radically improve your company's business if it is properly implemented, but the road to this Hadoop application is full of thorns. On the other hand, many businesses (not Google, Facebook or Twitter, of course) do not have the gigantic clusters needed for big data analytics.
As Dabid Wheeler put it, "All other areas of computer science have an indirect level of indirection," and Hadoop is a similar indirect solution; making the right software architecture decisions as your boss is drawn to some of the buzzwords. It will be very tough.
Here are some alternatives to try before investing in Hadoop:
Know your data
The total volume of data
Hadoop is an effective solution for large data sets.
GBFS file system above HDFS. So if your files are just MB-level, it is a good idea to consolidate several files (zip or tar) to hundreds of megabytes or a few gigabytes.
HDFS will split the file and store it in 64MB, 128M or larger blocks.
If your data set is very small, then using this giant ecosystem will not be suitable. This requires a good understanding of your own data and what types of queries are needed and if your data is really large enough.
On the other hand, given that your computational instructions can be very large, there may be errors in the volume of the data measured through the database alone. Sometimes math or analyzing the arrangement of small datasets may result in much larger results than the actual data volume, so the point is that you have a solid understanding of the data.
The speed of data growth
You may have several terabytes of data in a data warehouse or other data source, but one factor that must be considered before building a Hadoop cluster is the speed at which data can grow.
Ask your analyst a few simple questions, such as:
How fast is the data growth? Are these data growing at a very fast rate?
How many months or years after the data volume?
Many companies have annual growth in data. In this case, your data growth rate is not really fast; so here is recommended to consider archiving and removal options, rather than directly to Hadoop.
How to reduce the data to be processed
If you do have a very large volume of data, you can consider reducing the data to a very manageable volume by the following options, the following options have been tested by the industry for decades.
Consider archiving
Data archiving is the expired data separately stored, of course, storage time based on actual demand. This requires a good understanding of the data and how the application is using the data. For example, e-commerce company's big data processing only three months of data stored in the active database, while the old orders are stored in a separate store.
This approach can also be applied to your data warehouse. Of course, you can store more recent data for reporting and inquiries, and less frequently used data can be stored on a separate storage device.
Consider clearing the data
Sometimes we are busy data without knowing exactly how much data needs to be saved. If you store unusually large amounts of data, this will undoubtedly reduce the processing speed of your valid data. Understanding your business needs and examining whether data can be deleted and analyzing the type of data you need to store will not only save you storage space, it will also speed up the analysis of valid data.
An often-used best practice is to create additional columns for the data warehouse, such as created_date, created_by, update_date, and updated_by. Through these additional columns can be staged access statistics on the data, so that you can clear the data valid period. Here need to focus on the logic of data removal, remember to think about before realizing. If you use an archiving tool, then the removal of the data will be very easy.
Not all the data is important
You may not be able to afford the temptation to store all business related data, you may have a lot of data sources, such as: log files, marketing activities data, ETL jobs. You need to understand that not all data is critical to your business and that it is not beneficial to have all your data stored in a data warehouse. Filter out unneeded data at the data source, even before it is stored in the data warehouse. Do not store all the data, only the data you need.
Note what data you want to collect
Take online video editing business, you will need to save all the user operations you do? This may result in a very large data volume, if you find that your data warehouse is not enough to deal with these data, you may consider only Store metadata. Although video editing is a very extreme example, it does not prevent us from considering this information in other use cases.
All in all, only the data needed is collected according to the needs of the business.
Insight
Hire an analyst to understand the business
So far you should have clearly understood the importance of data; so hiring an analyst who understands business will be of great help to your business when you've decided to use Hadoop after all of the above steps.
Hadoop will not do anything if data analysts do not understand how to get value from it and do not mean investing in employees who have a deep understanding of the business. Encourage them to experiment more and use new ways to analyze the same data and find ways to monetize existing infrastructure.
Use statistical sampling for decision making
Statistical sampling can be said to be a very ancient technology, researchers and mathematicians use it to infer reasonable conclusions on large volume data. Through this step, we can drastically reduce the data volume. Instead of tracking billions or millions of data points, you only have to track thousands or hundreds of data points. Although this measure will not give us accurate results, we can have a high level of understanding of large data sets.
Enhance technology
Have you really reached the limit of relational database processing?
Before exploring other areas, you should examine whether the relational database can continue to deal with the problem. Traditional relational databases have been in use for a long time, and many organizations already use it to manage terabytes of data warehouses. So before moving to Hadoop, may wish to consider the following method.
Split data
Data segmentation is the logical or physical data is divided into several better maintenance or access to the part, while many popular open source relational database support fragmentation (such as MySQL Partitioning and Postgres Partitioning).
Consider database shards on traditional databases
Database sharding is to enhance the performance of the traditional relational database, the last resort, suitable for data can be logical shard on different nodes and rarely cross-node join sharing situation. In web applications, based on user sharding and storing user-related information on the same node is a common way to improve performance.
Fragmentation has many limitations, so it is not suitable for all scenarios, there are too many cross-node jion in the use case, the fragment will not play any role.
to sum up
Hadoop deployment will cost the company a huge amount of manpower and material resources, if you can achieve the goal by upgrading the existing infrastructure is also a good policy.
【Editor's Choice】
Ombudsman said: Cloud security risks lurking in big data projects White Elephant: Developers essential Hadoop weapon big data digging depression microblogging fly it? [Editor: Xiao Yun TEL: (010) 68476606]