1. Big data is now synonymous with spark, is a fast cluster computing system, one of its functions is streaming, support real-time data flow, the real-time data stream into discrete data stream discretized stream, wherein each discrete set of RDD resilient Distributed dataset
2. Calculation functions include: FlatMap: One-to-many, map: one-to-one, Reducebykey: Merge value by key
3. In Spark's program, a calculation formula is established, but not executed, only after spark streaming context start
4. The problem with D is how to ensure that the underlying structured data flow is not fragmented, such as when each RDD does transform.
5. A spark worker/executor requires a thread to occupy a core, and their total does not exceed the number of cores
6. Each dstream corresponds to a receiver, and each spark receiver needs a thread
7. Like Kafka, you can subdivide multiple topic so that multiple Dstream can be used to receive the data stream, thus increasing the degree of concurrency
Big Data Notes