International - English

Cart Console

Topic Center

Contact Sales

Home > Hot Categories > Big Data

Big Data Architecture: Spark

Last Update:2015-11-09 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark is a distributed computing framework implemented by the UC Berkeley AMP lab based on the map reduce algorithm, where output and results are stored in memory without the need to read and write HDFs frequently and data processing is more efficient
Spark for near-line or quasi-real-time, data mining and machine learning scenarios

Spark and Hadoop

Spark is a low-latency clustered distributed computing system for very large data sets, about 40 times times faster than Mapreducer.
Spark is an upgraded version of Hadoop, using HDFs as a first-generation product, the second generation has added the cache to save intermediate results, and can proactively push map/reduce tasks, and the third generation is the flow streaming that spark advocates.
Spark is a Hadoop-compatible API that reads and writes to Hadoop's HDFs HBASE sequence files.

Fault Tolerance

– Lineage-based fault tolerance, data recovery

–checkpoint

Checkpoint is an internal event that, when activated, triggers the database write process (DBWR) to write out the dirty blocks in the data buffer (DataBuffer CACHE) to the data file.

In the database system, the write log and write data file is the largest IO consumption in the database two operations, in which write data files are distributed write, write log files are sequential write, so in order to ensure the performance of the database, usually the database is guaranteed in the commit (commit) Make sure that the logs are written to the log file before you complete, and that the dirty chunks are saved in the data cache (buffer cache) and are not periodically written to the data file. This means that the log write and commit operations are synchronous, and the data write and commit operations are not synchronized. So there is a problem, when a database crashes does not guarantee that the cache of dirty data are all written to the data file, so that when the instance is started using log files to restore operations, the database to the state before the crash, to ensure consistency of data. Checkpoints are an important mechanism in this process to determine which redo logs should be scanned and applied to recovery when recovering.

Generally speaking, checkpoint is a database event, the checkpoint event is emitted by the checkpoint process (Lgwr/ckpt process), and checkpoint writes the dirty block to disk when the Dbwn event occurs. The file header of both the data file and the control file is also updated to record the checkpoint information.

Sparkstreaming

What is sparkstreaming:

Spark is a MapReduce distributed computing framework similar to Hadoop, with the core of an elastic distributed dataset (RDD, a collection of data in memory) that provides a richer model than mapreduce, which can quickly iterate over the data set in memory. To support complex data mining algorithms and graphics computing algorithms. Spark has the advantage of Hadoop MapReduce, but unlike Hadoop MapReduce, the intermediate output of compute tasks and results can be saved in memory, eliminating the need to read and write HDFs, saving disk IO consumption, claiming performance is 100 times times faster than Hadoop. Spark streaming is a real-time computing framework built on spark that extends the ability of spark to handle large-scale streaming data. That is, sparkstreaming is a streaming computing framework based on Spark.

The advantages of Spark streaming are:

1, can run on the 100+ node, and to achieve the second-level delay.

2. Use of memory-based spark as the execution engine with efficient and fault-tolerant features.

3, can integrate spark batch processing and interactive query.

4, for the implementation of complex algorithms to provide and batch processing similar simple interface.

Sparkstreaming principle

Spark streaming is the decomposition of streaming calculations into a series of short batch jobs. The batch engine here is spark, which divides the input data of spark streaming into a segment of data (discretized Stream) According to batch size (for example, 1 seconds). Each piece of data is converted to the RDD (resilient distributed Dataset) in Spark, and the spark The transformation operation of Dstream in streaming becomes the transformation operation for the RDD in Spark.

Big Data Architecture: Spark

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

big data architecture patterns big data analytics with spark pdf big data hadoop and spark developer certification training spark and python for big data with pyspark apache spark architecture diagram lambda architecture spark lambda architecture spark streaming

Big Data era: a summary of knowledge points based on Microsof... 11-05

Big Data Architecture Development Mining Analytics Hadoop HBa... 04-28

Big Data Architecture Development Mining Analytics Hadoop HBa... 12-02

0 Basic Learning Cloud computing and Big Data DBA cluster Arc... 02-21

"Big Data dry" implementation of big data platform based on H... 10-21

MYSQL Big Data Import 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Big Data Architecture: Spark

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support