Big Data Streaming-a problem that can't be ignored

Last Update:2015-05-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

My last blog, "Two modes of Big Data Processing", discusses the large-format memory-based streaming processing and hard disk-based storage processing. Compare these two processing modes, because the memory processing performance is the hard disk's n magnitude, so the stream processing efficiency is much higher than the storage processing, but the stream processing itself has a disadvantage, or is a worry, last time did not mention, today to say.
This depends on the fundamentals of data Processing: memory, storage, data.
As we all know, a big data cluster is made up of many computers connected to the network. The computer has CPU, memory, hard disk, computer Exchange data through the network to perform distributed computing work. The cluster will follow the rules, while running a batch of distributed computing tasks to perform different tasks, each distributed computing task processing data capacity is not the same, a few hundred m, more than a few hundred g, larger sometimes to reach the scale of TB (we deployed in the Laxcus cluster often processing terabytes of data). What if, at some point in the cluster, a computing task of large data capacity is burst out, and the data is scattered across different computers to perform computational work, and the total data capacity exceeds the memory capacity of the cluster?
In storage mode, this problem is easy to solve: Take the hard drive to do the cache transition. Data come in, check its size, if too large, or womb processing has not come, first put to the hard disk to save. After all, now the hard disk has been done terabytes, not bad money, a computer can also be more than a few. The amount of storage space available is much larger than memory.
Put in streaming mode, this problem is tangled. If the data comes in after the hard drive is processed, it is no different from the storage mode. If this is not the case, the data will be too much, memory is not enough, the memory will overflow, the data will be lost. Any computer in the cluster has such a failure, and the entire distributed computing task is a failure.
One way to alleviate this problem is to upgrade the computer, replace the CPU with 64-bit, and then load more memory. The reason is that 32-bit computer memory limit is 4G, in a cluster, if it is a 32-bit computer, and several terabytes of computing tasks, how many computers? A 64-bit computer can load more memory, so the number of computers can be less. Also incidentally, although the price of memory is now much cheaper than before, but compared with the hard disk, the unit capacity is still much more expensive! The cost of such a problem is common operators will be more concerned. In addition, this is only a temporary solution, who does not know when the next big Data Computing task happens, and there will be several such super-large computing tasks occur.
A more reliable solution is to make an assessment between the amount of data and the cluster memory before the task is calculated. When the calculation task comes in, determine the maximum size of the data it carries, and if the cluster has enough memory, "pre-allocate" the memory to this computing task (this work should be drawn to each computer). If not enough, let it wait until other computing tasks are done, the memory is recycled, and the new memory is sufficient to perform the work. The second approach is similar to the storage mode, where the data is stored first on the hard disk, and then the memory is enough to perform its work. Of course, both of these methods will reduce the computational efficiency of streaming, but there is no way, it is better than the memory overflow, the calculation of the failure of the task.
In summary, streaming is a high cost and efficiency ratio of the calculation model. If you are local tyrants, like bat, have enough silver to focus on the high performance of data processing, do not care to spend more money on the infrastructure, can be equipped with a strong CPU, large memory and hard disk or SSD, million gigabit fiber network, this time plus streaming is the election. If you are a poor, lack of silver, poor computer performance, hand on a 32-bit old-fashioned computer (we have a laxcus cluster is still using PENTIUMIII figure Latin chip, because this guy save electricity, old and oppositely! , memory is limited, the network is not good, can not afford too much electricity, do not care about the fast and slow data calculation, then make up, or consider the storage mode it.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Big Data Streaming-a problem that can't be ignored

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Big Data Streaming-a problem that can't be ignored

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support