10 frontline experts share spark status and future----summit excerpts

Source: Internet
Author: User
Tags databricks

CSDN Big Data Technology:

10 frontline experts share spark status and future (i)

10 frontline experts share Spark's status and Future (II.)

10 frontline experts share spark status and future (iii)

Some excerpts:

Dr. Matei Zaharia:spark, Amp Lab, University of California, Berkeley Status and future----(Matei Zaharia is a PhD student at the AMP Lab at the University of California, Berkeley, co-founder and current CTO of Databricks Corporation. Zaharia is dedicated to systems and algorithms for large-scale data-intensive computing. Research projects include: Spark, Shark, Multi-resource fairness, MapReduce scheduling, SNAP Sequence aligner)

Spark is a cluster computing platform from the University of California, Berkeley, Amplab, based on in-memory computing, from multi-iteration batch processing, eclectic data warehousing, stream processing and graph computing and other computational paradigms, is a rare all-round player.

Project History:

Spark started as the project in 2009

Open sourced in 2010

Growing community since

Entered Apache Lncubator in June 2013

Release Growth:

Spark 0.6----Java API, Maven, standalone mode, contributors

Spark 0.7----Python API,Spark streaming , contributors

Spark 0.8----YARN, MLlib, monitoring UI, contributors----high availability for standalone mode (0.8.1)

Spark 0.9----Scala 2.10 support, Configuration system, spark streaming improvement

Projects bulit on Spark:

Shark (SQL),Spark streaming (real-time), GraphX (graph), mlbase (machine learning)

Databricks CEO Ion Stoica: Transforming data into value----(Ion Stoica is a UC Berkeley computer professor, Amplab co-founder, Flex-to-peer protocol chord, cluster Memory computing framework Spark, Cluster resource management platform Mesos all from him)

Turning Data into Value

What does We need?

Interactive queries (interactive query)----Enable faster decision

Queries on streaming data (stream-based query)----enable decisions on real-time data----Eg:fraud detection (fraud detection), detect DDoS Atta CKS (detection of DDoS attacks)

Sophisticated data processing (complex data processing)----enable "better" decision

Our Goal:

Support batch, streaming, and interactive computation (batch processing, streaming, interactive computing)... in a unified framework

Easy to develop sophisticated algorithms(e.g..,graph,ml algos)

Big Data Challenge: Time, money, Answer quality

The tradeoff between processing speed and accuracy: inversely proportional

Tim Tully: integrated Spark/shark to Yahoo data analytics platform

Sharethrough data expert Ryan Weald: a product of spark streaming media

Keys to Fault tolerance:

Receive fault tolerance----use Actors with supervisor, use self healing connection pools

Monitoring Job Progress

RDDs: Elastic Distributed Data set

Low latency & Scale (lower latency & Mass)

Iterative and Interactive computation (iterative and interactive computing)

Databricks founder Patrick Wendell: Understanding Spark Application Performance----(focused on large-scale data-intensive computing. Dedicated to spark's performance benchmarks and co-author of Spark-perf. The summit he was on spark deep mining, UI overview and test equipment, common performance and Errors)

Summary of components:

Tasks:fundamental Unit of work

Stage:set of tasks that run in parallel

Dag:logical graph of RDD operations

Rdd:parallel DataSet with partitions

Demo of perf UI----problems:

Scheduling and launching tasks

Execution of tasks

Writing data between stages

Collecting results

Databricks Client Solution Director Pat McDonough: Spark's parallel programming----(a comprehensive overview of Spark's performance, components, and more)

UC Berkeley Dr Tathagata Das: Real-time Big data processing with spark streaming----(what is spark flow, why spark Stream, its performance and fault tolerance mechanism)



Batches of input data is replicated in memory for Fault-tolerance

Data lost due to worker Failure,can is recomputed from replicated input data

All transformations is Fault-tolerant,and exactly-once transformations

higher throughput than Storm:

Spark streaming:670k records/sec/node

storm:115k records/sec/node

Fast Fault Recovery:

  Recovers from Faults/stragglers within 1 sec

Spark 0.9 in Jan----out of alpha

  Automated Master Fault Recovery

Performance optimizations

  Web UI, and better monitoring capabilities

Cluster Manager UI----Standalone mode:<master>:8080

Executor Logs----Stored by Cluster Manager on each worker

Spark Driver Logs----Spark Initializes a log4j when created, Include log4j.properties file on the Classpath

Application Web UI----http://spark-application-host:4040----for executor/task/stage/memory status,etc

10 frontline experts share spark status and future----summit excerpts

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.