1, yes, we are big data also write common Java code, write ordinary SQL.
For example, the Java API version of the Spark program, the same length as the Java8 stream API.
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);
Another example is to delete a Hive table.
DROP TABLE pokes;
2. Yes, Hadoop, Spark, Hive start-up and machine operations are all very different from a common Java application and database.
Like starting HDFs.
Bash./start-dfs.sh
Like starting yarn.
Bash./start-yarn.sh
Like starting hive.
Bash./hive
It's over, what's the mystery???? Is it a bunch of configurations that all systems have?
3. Sorry, there is no technology called Data Warehouse.
A data warehouse is a place where all of the cleansed, unified data stores and analyses are assembled in a range, without a technology called data warehousing.
In combat, we generally use Hive as the carrier of the data warehouse, in the absence of big data infrastructure companies will also use a variety of traditional DB as a carrier of the data warehouse, so don't say what you want to learn Data warehouse OK? To learn hive is to learn about hive, to learn data governance and to learn data governance.
4, yes, our big data is dead write SQL, but the brain loop is not the same as you
You write SQL first think function, we write SQL first think this can be the mother to run out.
You write SQL can always tune, we write SQL to want a long time to tune, even the machine is what run to think clearly.
You write SQL regardless of the data distribution, we write SQL The first thing is that the mother is not the data tilt it?
You write SQL can be written directly, we write SQL before writing 10,000 SQL to do data cleansing.
5, to, 10 times times, 100 times times, 1 million times times the data growth we need to change the plan, change and change.
Your SQL can run at 10 times times, at 1 million times times, you may have to spend a very long time thinking and effort to basically run out, such as a simple to redo statistics.
Your SQL count (1) group by IS out.
If I write as much as you do, I don't think it will be the end of my life.
Do not explain, big Data count series to understand.
Big Data counting principle 1+0=1 that you're not counting. (10) no.77
6. Spark is fast, but spark is slow.
Spark is a pure memory calculation, but Spark is also a batch calculation, in which there are flaws you think about it and compare the pure flow calculation of FLink.
7, even if you have 100T data, you are not big data.
The first data storage space does not mean that big data, the second even if you have enough data to think about the wrong you are not large data.
8, Big Data and machine learning is a, can not do without
You may never know the unity and importance of divide and conquer, statistics, probability theory in these two disciplines.
9, I am sorry, you do not think big data only Hadoop, big Data technology stack wide and won you almost unthinkable.
You think you've finished your studies and you can't "edge" at all.
Https://mp.weixin.qq.com/s/ynz-mLlyO052LxyhbyovAw
Big data from a large number, rapid change, a variety of characteristics, low-value data to obtain irreplaceable value, its challenges and difficulties, many of the repeated data processing work to do, need to have tools to automate thinking to change.
Big data is different from what you think.