Don't talk about Hadoop, and 4 data pipelines to build practice

Source: Internet
Author: User
Keywords Big data we can but

Today, the concept of big data has flooded the entire IT community, with a variety of products with large data technologies, and a variety of bamboo seen for processing large data tools like rain. At the same time, if a product does not hold the big data of the thigh, if an organization has not yet worked on Hadoop, Spark, Impala, Storm and other tall tools, will be the evaluation of obsolete yellow flowers. However, do you really need to use Hadoop as a tool for your data? Do you really need large data technology to support the data type of your business processing?

Since the big data, we first look at "big", that is, the volume of data. In the CSDN, we can see an article that was shared by the Liu chief editor-"Don't be ridiculous, your data isn't big enough". The article comes from a data scientist with years of experience, Chris Stucchio, a postdoctoral fellow at the New York University Crown Institute, a high-frequency trading platform, a CTO at a start-up, and more accustomed to calling himself a statistical scholar. Let's look at his view together:

Hadoop is simply a tool for running a general-purpose calculation, and because of this, you are limited to a variety of rules during use, such as all calculations must be written according to a map, a group BY, a aggregate, or a sequence of calculations. It's like putting on a corset, but because Hadoop and big data are hot words, half the world wants to put on tights even if they don't need them. So does your data really need to use Hadoop as a tool?

1. Good hundreds of M data, Excel can not fit! This level is completely unrelated to the "big", like pandas such a tool can be handled well, it can load hundreds of m of data into memory, in the blink of an numpy can complete billions of times floating-point calculations.

2. Data volume up to 10g! This level of data is still not big data, the current notebook memory can be added to 16G, and many tools are not a one-time load of data into memory completely.

3.100gb/500gb/1tb! Data 1 2TB hard drive only hundreds of, buy a piece to replace, and then decisively installed PostgreSQL and so on.

With scripts like Python, Hadoop has no advantage in programming, and because Hadoop is typically slower than other technologies because of data flow overhead across nodes, you really need to churn Hadoop if your data is over 5TB.

Chris is analyzing the volume of your data to see if it's big data, whether it really needs to use big data technology, but there are velocity, produced, and value for the big data, and we're going to look at MongoDB sharing "big data beyond the big stuff." :

MONGOHQ: Don't belittle other avenues because of the benefits of big data

"Big Data", the classic quote from The Hitchhiker's Guide to the Galaxy, is the "is". You are won ' t believe how vastly, hugely, mind-bogglingly the big it is. I score You there's a lot of data in the Wikipedia but that's ethically peanuts to the big data. This is also the misconception that many people go into big data--they first assume they have to use big data technology, and yet we're far from big data, so how do we get the big data?

Back in the the 1990s, people realized that digital storage data was much cheaper than paper, and when something was cheap enough, it became an inevitable option. Humans instinctively store all the data, because "we may need them in the future" and the store is so cheap, why not?

And from an article in 1990 by American scientists called "Saving all Bits", scientists have had to face the challenge of saving all the data, Peter. Denning explains the challenge to NASA's ability to save all the Hubble Space Telescope data: The device generates data that requires 2500 CDs per day, which not only overwhelm the performance of the network and storage devices, but also exceed the "human understanding". But let's not lose sight of the fact that with the storage technology and the economic situation, these 2500 CDs are only equivalent to the current 100-dollar hard drives, and we don't seem to need to store so much data from a space telescope.

The limited value of large data

Today we can store virtually any data that has a significant business purpose, such as credit card sales and surveys. At the same time, we can store all the business purposes are not obvious data, such as: User behavior on a Web page, cable junction box users watch TV channels, the use of physical network switch lights or door behavior. But the value of the latter kind of behavior is undoubtedly very low.

A credit card transaction contains a lot of data, such as: People's information, location, value and so on. In the sales cycle, you will naturally capture the data. However, the behavior of a user on a Web site is clearly less valuable, and you may collect the URLs that users visit and the time it takes to read a page, but the value of these records is clearly not as rich as credit card transactions. Of course, if you want to classify your users, these records will have a certain value.

However, the cost of storage is getting less and more, and the more data you have, the more value you can derive from the trend of data analysis. The information that each TV channel converts does not matter, but if you put this data together with the dispatcher ad data as an aggregated dataset, you will be able to clearly understand the user's behavior, which will provide valuable insights to advertisers and program designers.

Similarly, the value of the information collected in the smart home system is lower, you may only get some event and state information, and the system may produce a lot of data, the value must be through a lot of filtering, filtering and other processing to reflect. The biggest challenge of big data is getting information from a lot of debris, or relying on a lot of valuable data, and then stripping the cocoon and finding the truth. It should be noted that this is not a needle in the haystack, but rather from a pile of needles in the qualitative.

Hot data vs. Big Data

The reason for the need for large data is that you not only have a lot of data, but also have a large number of requests to access the data, and DA data seems to meet that demand.

Bigdata data is more likely to be cold data, which is the data that you do not visit frequently, and may not be used again except for analysis. It may soon be replaced with fresh cold data, and new cold data will generate new analysis, but the range of big data needs to be separated from the thermal data, because the result of mixing two requirements is necessarily lower than expected, so the analysis of cold and thermal data is bound to be passable. In any case, it's a good idea to distinguish hot and cold data, both storage and application should be treated differently. But there are always people who provide users with the "elixir" of Big data.

Therefore, please pay attention to your data, the type of data, to business as a requirement, do not have to mix all the data together to create 1 large data.

Original link: http://www.csdn.net/article/2014-03-28/2819018-bigdata-debate-and-4-practices

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.