Big data is different from what you think.

Source: Internet
Author: User
Tags stream api

1, yes, we are big data also write common Java code, write ordinary SQL.

For example, the Java API version of the Spark program, the same length as the Java8 stream API.

JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);

Another example is to delete a Hive table.

DROP TABLE pokes;

2. Yes, Hadoop, Spark, Hive start-up and machine operations are all very different from a common Java application and database.

Like starting HDFs.

Bash./start-dfs.sh

Like starting yarn.

Bash./start-yarn.sh

Like starting hive.

Bash./hive

It's over, what's the mystery???? Is it a bunch of configurations that all systems have?

3. Sorry, there is no technology called Data Warehouse.

A data warehouse is a place where all of the cleansed, unified data stores and analyses are assembled in a range, without a technology called data warehousing.

In combat, we generally use Hive as the carrier of the data warehouse, in the absence of big data infrastructure companies will also use a variety of traditional DB as a carrier of the data warehouse, so don't say what you want to learn Data warehouse OK? To learn hive is to learn about hive, to learn data governance and to learn data governance.

4, yes, our big data is dead write SQL, but the brain loop is not the same as you

You write SQL first think function, we write SQL first think this can be the mother to run out.

You write SQL can always tune, we write SQL to want a long time to tune, even the machine is what run to think clearly.

You write SQL regardless of the data distribution, we write SQL The first thing is that the mother is not the data tilt it?

You write SQL can be written directly, we write SQL before writing 10,000 SQL to do data cleansing.

5, to, 10 times times, 100 times times, 1 million times times the data growth we need to change the plan, change and change.

Your SQL can run at 10 times times, at 1 million times times, you may have to spend a very long time thinking and effort to basically run out, such as a simple to redo statistics.

Your SQL count (1) group by IS out.

If I write as much as you do, I don't think it will be the end of my life.

Do not explain, big Data count series to understand.

Big Data counting principle 1+0=1 that you're not counting. (10) no.77

6. Spark is fast, but spark is slow.

Spark is a pure memory calculation, but Spark is also a batch calculation, in which there are flaws you think about it and compare the pure flow calculation of FLink.

7, even if you have 100T data, you are not big data.

The first data storage space does not mean that big data, the second even if you have enough data to think about the wrong you are not large data.

8, Big Data and machine learning is a, can not do without

You may never know the unity and importance of divide and conquer, statistics, probability theory in these two disciplines.

9, I am sorry, you do not think big data only Hadoop, big Data technology stack wide and won you almost unthinkable.

You think you've finished your studies and you can't "edge" at all.

Https://mp.weixin.qq.com/s/ynz-mLlyO052LxyhbyovAw

Big data from a large number, rapid change, a variety of characteristics, low-value data to obtain irreplaceable value, its challenges and difficulties, many of the repeated data processing work to do, need to have tools to automate thinking to change.

Big data is different from what you think.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.