Big Data Streaming-a problem that can't be ignored

Source: Internet
Author: User

My last blog, "Two modes of Big Data Processing", discusses the large-format memory-based streaming processing and hard disk-based storage processing. Compare these two processing modes, because the memory processing performance is the hard disk's n magnitude, so the stream processing efficiency is much higher than the storage processing, but the stream processing itself has a disadvantage, or is a worry, last time did not mention, today to say.
This depends on the fundamentals of data Processing: memory, storage, data.
As we all know, a big data cluster is made up of many computers connected to the network. The computer has CPU, memory, hard disk, computer Exchange data through the network to perform distributed computing work. The cluster will follow the rules, while running a batch of distributed computing tasks to perform different tasks, each distributed computing task processing data capacity is not the same, a few hundred m, more than a few hundred g, larger sometimes to reach the scale of TB (we deployed in the Laxcus cluster often processing terabytes of data). What if, at some point in the cluster, a computing task of large data capacity is burst out, and the data is scattered across different computers to perform computational work, and the total data capacity exceeds the memory capacity of the cluster?
In storage mode, this problem is easy to solve: Take the hard drive to do the cache transition. Data come in, check its size, if too large, or womb processing has not come, first put to the hard disk to save. After all, now the hard disk has been done terabytes, not bad money, a computer can also be more than a few. The amount of storage space available is much larger than memory.
Put in streaming mode, this problem is tangled. If the data comes in after the hard drive is processed, it is no different from the storage mode. If this is not the case, the data will be too much, memory is not enough, the memory will overflow, the data will be lost. Any computer in the cluster has such a failure, and the entire distributed computing task is a failure.
One way to alleviate this problem is to upgrade the computer, replace the CPU with 64-bit, and then load more memory. The reason is that 32-bit computer memory limit is 4G, in a cluster, if it is a 32-bit computer, and several terabytes of computing tasks, how many computers? A 64-bit computer can load more memory, so the number of computers can be less. Also incidentally, although the price of memory is now much cheaper than before, but compared with the hard disk, the unit capacity is still much more expensive! The cost of such a problem is common operators will be more concerned. In addition, this is only a temporary solution, who does not know when the next big Data Computing task happens, and there will be several such super-large computing tasks occur.
A more reliable solution is to make an assessment between the amount of data and the cluster memory before the task is calculated. When the calculation task comes in, determine the maximum size of the data it carries, and if the cluster has enough memory, "pre-allocate" the memory to this computing task (this work should be drawn to each computer). If not enough, let it wait until other computing tasks are done, the memory is recycled, and the new memory is sufficient to perform the work. The second approach is similar to the storage mode, where the data is stored first on the hard disk, and then the memory is enough to perform its work. Of course, both of these methods will reduce the computational efficiency of streaming, but there is no way, it is better than the memory overflow, the calculation of the failure of the task.
In summary, streaming is a high cost and efficiency ratio of the calculation model. If you are local tyrants, like bat, have enough silver to focus on the high performance of data processing, do not care to spend more money on the infrastructure, can be equipped with a strong CPU, large memory and hard disk or SSD, million gigabit fiber network, this time plus streaming is the election. If you are a poor, lack of silver, poor computer performance, hand on a 32-bit old-fashioned computer (we have a laxcus cluster is still using PENTIUMIII figure Latin chip, because this guy save electricity, old and oppositely! , memory is limited, the network is not good, can not afford too much electricity, do not care about the fast and slow data calculation, then make up, or consider the storage mode it.

Big Data Streaming-a problem that can't be ignored

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.