Spark is a distributed computing framework implemented by the UC Berkeley AMP lab based on the map reduce algorithm, where output and results are stored in memory without the need to read and write HDFs frequently and data processing is more efficient
Spark for near-line or quasi-real-time, data mining and machine learning scenarios
Spark and Hadoop
- Spark is a low-latency clustered distributed computing system for very large data sets, about 40 times times faster than Mapreducer.
- Spark is an upgraded version of Hadoop, using HDFs as a first-generation product, the second generation has added the cache to save intermediate results, and can proactively push map/reduce tasks, and the third generation is the flow streaming that spark advocates.
- Spark is a Hadoop-compatible API that reads and writes to Hadoop's HDFs HBASE sequence files.
Fault Tolerance
– Lineage-based fault tolerance, data recovery
–checkpoint
Checkpoint is an internal event that, when activated, triggers the database write process (DBWR) to write out the dirty blocks in the data buffer (DataBuffer CACHE) to the data file.
In the database system, the write log and write data file is the largest IO consumption in the database two operations, in which write data files are distributed write, write log files are sequential write, so in order to ensure the performance of the database, usually the database is guaranteed in the commit (commit) Make sure that the logs are written to the log file before you complete, and that the dirty chunks are saved in the data cache (buffer cache) and are not periodically written to the data file. This means that the log write and commit operations are synchronous, and the data write and commit operations are not synchronized. So there is a problem, when a database crashes does not guarantee that the cache of dirty data are all written to the data file, so that when the instance is started using log files to restore operations, the database to the state before the crash, to ensure consistency of data. Checkpoints are an important mechanism in this process to determine which redo logs should be scanned and applied to recovery when recovering.
Generally speaking, checkpoint is a database event, the checkpoint event is emitted by the checkpoint process (Lgwr/ckpt process), and checkpoint writes the dirty block to disk when the Dbwn event occurs. The file header of both the data file and the control file is also updated to record the checkpoint information.
Sparkstreaming
What is sparkstreaming:
Spark is a MapReduce distributed computing framework similar to Hadoop, with the core of an elastic distributed dataset (RDD, a collection of data in memory) that provides a richer model than mapreduce, which can quickly iterate over the data set in memory. To support complex data mining algorithms and graphics computing algorithms. Spark has the advantage of Hadoop MapReduce, but unlike Hadoop MapReduce, the intermediate output of compute tasks and results can be saved in memory, eliminating the need to read and write HDFs, saving disk IO consumption, claiming performance is 100 times times faster than Hadoop. Spark streaming is a real-time computing framework built on spark that extends the ability of spark to handle large-scale streaming data. That is, sparkstreaming is a streaming computing framework based on Spark.
The advantages of Spark streaming are:
1, can run on the 100+ node, and to achieve the second-level delay.
2. Use of memory-based spark as the execution engine with efficient and fault-tolerant features.
3, can integrate spark batch processing and interactive query.
4, for the implementation of complex algorithms to provide and batch processing similar simple interface.
Sparkstreaming principle
Spark streaming is the decomposition of streaming calculations into a series of short batch jobs. The batch engine here is spark, which divides the input data of spark streaming into a segment of data (discretized Stream) According to batch size (for example, 1 seconds). Each piece of data is converted to the RDD (resilient distributed Dataset) in Spark, and the spark The transformation operation of Dstream in streaming becomes the transformation operation for the RDD in Spark.
Big Data Architecture: Spark