Spark Asia Pacific Research Institute Stage 1 Public Welfare lecture hall in the Age of cloud computing and big data [Stage 1 interactive Q & A sharing]
Q1: Can spark streaming join different data streams?
Different spark streaming data streams can be joined;
Spark streaming is an extension of the core spark API that allows enables high-throughput, fault-tolerant stream processing of live data streams. data can be ingested from many sources like Kafka, flume, Twitter, zeromq or plain old TCP sockets and be processed using complex algorithms expressed with high-level functions likemap
,reduce
,join
Andwindow
Join(Otherstream,[Numtasks]): When called on two dstreams of (K, V) and (K, W) pairs, return a new dstream of (K, (V, W )) pairs with all pairs of elements for each key.
Q2: Is flume and spark streaming applicable to the cluster mode?
Flume and spark streaming are generated for clusters;
For input streams that receive data over the network (such as, Kafka, flume, sockets, etc .), the default persistence level is set to replicate the data to two nodes for fault-tolerance.
Using any input source that records es data through a network-For network-based data sources like Kafka and flume, the specified ed input data is replicated in memory between nodes of the cluster (default replication factor is 2 ).
Q3: Is spark flawed?
The core disadvantage of spark is that it occupies a large amount of memory;
In earlier versions, spark mainly processes data in a coarse-grained manner, making it difficult to precisely control the data;
After the fair mode is added, fine-grained processing can be performed;
Q4: Is spark streming currently in production?
Spark streaming is very easy to use in the production environment;
No deployment is required. You only need to install spark, and spark streaming is ready;
Spark streaming is used in China, such as Pipi network;