Dr. Matei Zaharia:spark, Amp Lab, University of California, Berkeley Status and future----(Matei Zaharia is a PhD student at the AMP Lab at the University of California, Berkeley, co-founder and current CTO of Databricks Corporation. Zaharia is dedicated to systems and algorithms for large-scale data-intensive computing. Research projects include: Spark, Shark, Multi-resource fairness, MapReduce scheduling, SNAP Sequence aligner)

Spark is a cluster computing platform from the University of California, Berkeley, Amplab, based on in-memory computing, from multi-iteration batch processing, eclectic data warehousing, stream processing and graph computing and other computational paradigms, is a rare all-round player.

Project History:

Spark started as the project in 2009

Open sourced in 2010

Growing community since

Entered Apache Lncubator in June 2013

Release Growth:

Spark 0.6----Java API, Maven, standalone mode, contributors

Spark 0.7----Python API,Spark streaming , contributors

Spark 0.8----YARN, MLlib, monitoring UI, contributors----high availability for standalone mode (0.8.1)

Spark 0.9----Scala 2.10 support, Configuration system, spark streaming improvement

Projects bulit on Spark:

Shark (SQL),Spark streaming (real-time), GraphX (graph), mlbase (machine learning)

Databricks CEO Ion Stoica: Transforming data into value----(Ion Stoica is a UC Berkeley computer professor, Amplab co-founder, Flex-to-peer protocol chord, cluster Memory computing framework Spark, Cluster resource management platform Mesos all from him)

Turning Data into Value

What does We need?

Interactive queries (interactive query)----Enable faster decision

Queries on streaming data (stream-based query)----enable decisions on real-time data----Eg:fraud detection (fraud detection), detect DDoS Atta CKS (detection of DDoS attacks)

Sophisticated data processing (complex data processing)----enable "better" decision

Our Goal:

Support batch, streaming, and interactive computation (batch processing, streaming, interactive computing)... in a unified framework

Easy to develop sophisticated algorithms(e.g..,graph,ml algos)

Big Data Challenge: Time, money, Answer quality

The tradeoff between processing speed and accuracy: inversely proportional

Tim Tully: integrated Spark/shark to Yahoo data analytics platform

Sharethrough data expert Ryan Weald: a product of spark streaming media

Keys to Fault tolerance:

Receive fault tolerance----use Actors with supervisor, use self healing connection pools

Monitoring Job Progress

RDDs: Elastic Distributed Data set

Low latency & Scale (lower latency & Mass)

Iterative and Interactive computation (iterative and interactive computing)

Databricks founder Patrick Wendell: Understanding Spark Application Performance----(focused on large-scale data-intensive computing. Dedicated to spark's performance benchmarks and co-author of Spark-perf. The summit he was on spark deep mining, UI overview and test equipment, common performance and Errors)

Summary of components:

Tasks:fundamental Unit of work

Stage:set of tasks that run in parallel

Dag:logical graph of RDD operations

Rdd:parallel DataSet with partitions

Demo of perf UI----problems:

Scheduling and launching tasks

Execution of tasks

Writing data between stages

Collecting results

Databricks Client Solution Director Pat McDonough: Spark's parallel programming----(a comprehensive overview of Spark's performance, components, and more)

UC Berkeley Dr Tathagata Das: Real-time Big data processing with spark streaming----(what is spark flow, why spark Stream, its performance and fault tolerance mechanism)



Batches of input data is replicated in memory for Fault-tolerance

Data lost due to worker Failure,can is recomputed from replicated input data

All transformations is Fault-tolerant,and exactly-once transformations

higher throughput than Storm:

Spark streaming:670k records/sec/node

storm:115k records/sec/node

Fast Fault Recovery:

  Recovers from Faults/stragglers within 1 sec

Spark 0.9 in Jan----out of alpha

  Automated Master Fault Recovery

Performance optimizations

  Web UI, and better monitoring capabilities

Cluster Manager UI----Standalone mode:<master>:8080

Executor Logs----Stored by Cluster Manager on each worker

Spark Driver Logs----Spark Initializes a log4j when created, Include log4j.properties file on the Classpath

Application Web UI----http://spark-application-host:4040----for executor/task/stage/memory status,etc

