CSDN Big Data Technology:
10 frontline experts share spark status and future (i)
10 frontline experts share Spark's status and Future (II.)
10 frontline experts share spark status and future (iii)
Some excerpts:
Dr. Matei Zaharia:spark, Amp Lab, University of California, Berkeley Status and future----(Matei Zaharia is a PhD student at the AMP Lab at the University of California, Berkeley, co-founder and current CTO of Databricks Corporation. Zaharia is dedicated to systems and algorithms for large-scale data-intensive computing. Research projects include: Spark, Shark, Multi-resource fairness, MapReduce scheduling, SNAP Sequence aligner)
Spark is a cluster computing platform from the University of California, Berkeley, Amplab, based on in-memory computing, from multi-iteration batch processing, eclectic data warehousing, stream processing and graph computing and other computational paradigms, is a rare all-round player.
Project History:
Spark started as the project in 2009
Open sourced in 2010
Growing community since
Entered Apache Lncubator in June 2013
Release Growth:
Spark 0.6----Java API, Maven, standalone mode, contributors
Spark 0.7----Python API,Spark streaming , contributors
Spark 0.8----YARN, MLlib, monitoring UI, contributors----high availability for standalone mode (0.8.1)
Spark 0.9----Scala 2.10 support, Configuration system, spark streaming improvement
Projects bulit on Spark:
Shark (SQL),Spark streaming (real-time), GraphX (graph), mlbase (machine learning)
Databricks CEO Ion Stoica: Transforming data into value----(Ion Stoica is a UC Berkeley computer professor, Amplab co-founder, Flex-to-peer protocol chord, cluster Memory computing framework Spark, Cluster resource management platform Mesos all from him)
Turning Data into Value
What does We need?
Interactive queries (interactive query)----Enable faster decision
Queries on streaming data (stream-based query)----enable decisions on real-time data----Eg:fraud detection (fraud detection), detect DDoS Atta CKS (detection of DDoS attacks)
Sophisticated data processing (complex data processing)----enable "better" decision
Our Goal:
Support batch, streaming, and interactive computation (batch processing, streaming, interactive computing)... in a unified framework
Easy to develop sophisticated algorithms(e.g..,graph,ml algos)
Big Data Challenge: Time, money, Answer quality
The tradeoff between processing speed and accuracy: inversely proportional
Tim Tully: integrated Spark/shark to Yahoo data analytics platform
Sharethrough data expert Ryan Weald: a product of spark streaming media
Keys to Fault tolerance:
Receive fault tolerance----use Actors with supervisor, use self healing connection pools
Monitoring Job Progress
RDDs: Elastic Distributed Data set
Low latency & Scale (lower latency & Mass)
Iterative and Interactive computation (iterative and interactive computing)
Databricks founder Patrick Wendell: Understanding Spark Application Performance----(focused on large-scale data-intensive computing. Dedicated to spark's performance benchmarks and co-author of Spark-perf. The summit he was on spark deep mining, UI overview and test equipment, common performance and Errors)
Summary of components:
Tasks:fundamental Unit of work
Stage:set of tasks that run in parallel
Dag:logical graph of RDD operations
Rdd:parallel DataSet with partitions
Demo of perf UI----problems:
Scheduling and launching tasks
Execution of tasks
Writing data between stages
Collecting results
Databricks Client Solution Director Pat McDonough: Spark's parallel programming----(a comprehensive overview of Spark's performance, components, and more)
UC Berkeley Dr Tathagata Das: Real-time Big data processing with spark streaming----(what is spark flow, why spark Stream, its performance and fault tolerance mechanism)
Dstreams+rdds=power
Fault-tolerance:
Batches of input data is replicated in memory for Fault-tolerance
Data lost due to worker Failure,can is recomputed from replicated input data
All transformations is Fault-tolerant,and exactly-once transformations
higher throughput than Storm:
Spark streaming:670k records/sec/node
storm:115k records/sec/node
Fast Fault Recovery:
Recovers from Faults/stragglers within 1 sec
Spark 0.9 in Jan----out of alpha
Automated Master Fault Recovery
Performance optimizations
Web UI, and better monitoring capabilities
Cluster Manager UI----Standalone mode:<master>:8080
Executor Logs----Stored by Cluster Manager on each worker
Spark Driver Logs----Spark Initializes a log4j when created, Include log4j.properties file on the Classpath
Application Web UI----http://spark-application-host:4040----for executor/task/stage/memory status,etc
10 frontline experts share spark status and future----summit excerpts