Yahoo's spark practice
Yahoo is one of the big data giants who have a unique passion for spark. This summit, Yahoo contributed three speeches, let us one by one.
Andy Feng, a prominent Yahoo architect from the University of Zhejiang , tried to answer two questions in his keynote speech.
First question, why Yahoo falls in love with Spark. Machine learning, Data science is the engine under the lid when Yahoo's content becomes a data-driven, context-sensitive, personalized page from edit selection. The technical team struggled to find a platform that could support large-scale, real-time, machine-learning exploration, especially with short life cycles, rapid response to bursts of content, and a model that needed to be re-trained in one hours or less, while training data could come from a massive 150PB black hole, The calculation may occur on 35000 servers. Yahoo's solution is hadoop+spark.
This leads to a second question, how Hadoop and Spark can work together. In short, the former do batch computing, the latter do iterative calculation, the two coexist in yarn, while sharing HDFs, hbase and other data storage . Yahoo's first pilot project was used by e-commerce in Yahoo Japan. The first spark program is the code for the collaborative filtering,30 line, which takes 10 minutes on 10 machines and 106 minutes for Hadoop-based implementations. The second pilot project is stream advertising, which is based on Vowpal Wabbit's logistic regression,120 line code, 100 million samples, 13,000 features, 30 iterations, and takes 30 minutes. The most interesting is that the algorithm was completed 2 hours after the Spark-on-yarn announcement.
Currently, Yahoo has 4 spark committer, making a significant contribution to Spark-on-yarn, Shark, security, scalability, and operational sustainability.
the next keynote speaker, Tim Tully, is also an outstanding architect for Yahoo , detailing the applications of spark and shark in the Yahoo data and analytics platform.
The 1999-2007 Yahoo data processing platform uses NFS to save data, C + + write-map/reduce implementation, and Perl script to do the glue, the disadvantage of this architecture is to move the data to the location of the calculation. Evolve to a Hadoop-centric architecture: Logs capture NFS First, move to HDFs, use pig or mapreduce to do ETL or massive joins, load the results into the Data Warehouse, and then use pig, mapreduce, or hive for aggregation and report generation. Reports are stored in oracle/mysql, while there are some business bi tools, and Storm-on-yarn do stream processing. The problem with this architecture is that it's too slow. Reports generate delays of up to 2-6 hours, and a huge join takes longer, and interactive queries are almost impossible. The original solution was not perfect, such as pre-calculation, save the results for future queries, but the results do not reflect real-time changes.
Yahoo considered pig on Tez, or hive on Tez. At this time, the appearance of Spark/shark made it the holy grail of Yahoo. As a result, Hadoop+spark's architecture came into being, with the basic design that Hadoop and Spark co-exist on yarn, and because some SQL workloads require predictable quality of service, there are a number of large memory clusters (satellite clusters) dedicated to running shark. In this architecture, Spark replaces Hadoop as an ETL, while Shark replaces the commercial Bi/olap tool, taking on reports/dashboards and interactive/ad hoc queries, while interfacing with desktop bi tools such as tableau. The current spark cluster deployed on Yahoo has 112 nodes, 9.2TB of memory, and is considering adding SSDs.
Tim introduces the future of work. First of all, Pig/mapreduce will take the place entirely, with all the ETL tasks performed by Spark. Second, although satellite clusters will still exist, Shark-on-spark-on-yarn will be deployed in 2014.
The third appearance was Yahoo's engineer Gavin Li, who introduced the application of spark in the audience expansion. Audience expansion is a way of looking for targeted users in ads: first advertisers provide sample customers who watch the ads and buy products, learn from them, find more potential users, and target them to ads. Yahoo uses the algorithm is the logistic regression, input data and intermediate data are terabytes, the original system uses Hadoop streaming,2 million lines of code, run 30000+ mappers,2000 reducers,20+ job , taking 16 hours. Porting directly to spark requires 6 engineers for 3 quarters of work. Yahoo's approach is to build a transition layer that automatically translates Hadoop streaming jobs into spark jobs in just 2 quarters. The next step is to analyze performance and optimize it.
The start of the spark version is twice times faster than the Hadoop streaming version, much less than expected. The following analysis and optimization of scalability and audience expansion irrelevant, there is a common reference. The main factor affecting scalability is shuffle, which is as follows: Mapper-side writes intermediate results (uncompressed data) to a file, each file corresponds to a reducer partition;reducer-side the file is read into memory to calculate, So the memory of the reducer machine determines the size of the partition, and after all the mapper is over, reducer begins to pull all the shuffle files and calculate them.
The following is a careful analysis of its problem: due to partition limited memory size, when the amount of data is large, the number of partition, that is, the number of shuffle files will be very large. In the case of Yahoo, 3TB compressed data (equivalent to 90TB non-compression) requires 46,080 partition/shuffle files. The first problem in Mapper-side, each mapper need to write 46,080 files concurrently, each file to 164KB of I/O buffer, if a server has 16 mapper, which requires 115GB of memory. The solution is to reduce the buffer size to 12KB, which reduces the memory consumption to 10GB. The second problem is that large numbers of small files make disk read and write inefficient, and file system metadata is expensive (it takes 2 hours to delete these files). The direct approach is to do memory compression in the reducer-side, so that the memory "big" 10-100 times, so that the effective size of partition also become larger, the number of shuffle files can be reduced to about 1600. Yahoo's patch even allows reducer to spill to disk when memory is insufficient, which completely solves the problem of scalability.
Gavin also describes a scenario in which spark's input data comes from Hadoop, and if Spark uses the same hash function as Hadoop, it can eliminate repeated partition and greatly reduce the number of shuffle files. The last question, considering that reducer must wait for all mapper to end in order to start pulling the shuffle file, in order to improve resource utilization, Yahoo increased the maxbytesinflight to improve network efficiency, The number of threads that are equal to twice times the number of physical cores is also allocated to increase the utilization of the kernel.
Yahoo's solution is universal and its contribution has entered Spark's codebase, which is why many companies embrace spark, on the one hand, the young, the opportunity to contribute, on the other hand, it is active and fast maturing. Adatao CEO Christopher Nguyen: A full-featured enterprise-Class Big data analytics solution
In addition to Matei's keynote speech, one of the speeches received the most hits on YouTube, which came from Adatao. Adatao is one of the early contributors to the spark community, and this presentation not only depicts a vision of building data intelligence on top of spark, but also brings a wonderful demo. Christopher from across the Atlantic, the Titanic sank in a tragic way, and similarly, when spanning the big data, the Yellow Elephant (Hadoop) can only go halfway.
Then there is the dry goods. Christopher shows that the first Demo:adatao pinsight is a narrative bi tool, expressed as a word processing tool (browser-based), and Adatao's cofounder,michael type a natural language-like Action statement in the document, It is sent to the Panalytics back-end server in the Cloud (EC2), Panalytics runs on top of spark, does data processing and mining, returns the result, pinsight it to be visualized and displayed in the document. Michael demonstrates the ability to capture flight data from data.gov, visualize it on the map of the United States, and see the data schema for a variety of aggregation, analysis, and interactive visualizations, very simple.
In addition to the business view, Pinsight also provides a data science view. This view is based on an integrated development environment, and interactive development can be based on the R language, which makes good use of R's visualization (plot) capabilities. Michael demonstrates the simplicity of analyzing and predicting delays using flight data.
Interestingly, the business view also supports data science work, and its interactive language also supports R and Python.
Then Christopher introduced several business cases: the Internet service Provider (ISP) transferred from Hive+tableau to Adatao for interactive, ad hoc queries; Customer service providers do multi-channel (mobile, web, etc.) Sales and product recommendations Heavy machinery equipment manufacturers to analyze sensor data, do predictive maintenance, before MongoDB, not easy to analyze, now go to spark, mobile advertising platform to do targeted advertising and conversion rate prediction.
Panalytics can complete all the analysis in 10 seconds. Christopher that the 10-second gap is not 60 times times the 10-minute difference, because once the delay exceeds a certain threshold, data scientists will change behavior and they will lose some creativity. In addition, for the linear modeling,panalytics to achieve a 1gb/second throughput, quite impressive.
Christopher laments that the spark community is strong enough to allow Adatao to achieve its current accomplishments in the short term, promising to give the code back to the community in the future. Databricks co-founder Patrick Wendell: Understanding the performance of spark applications
for Spark programmers, this speech is a must-see . Patrick from A simple example illustrates the importance of understanding the work mechanism of Spark, using group by key implementations that are 10-100 times slower than Reducebykey-based version performance. As he introduces the concepts of Rdd, DAG, stage, and task, and especially the working mechanism inside the task, it is a new understanding for many spark engineers.
He demonstrates the Web UI that spark 0.9 will publish with richer performance data to understand the underlying details of the spark application. Finally, he analyzes several common problems and solutions to performance:
The closure of the map operator (Closure) If the free variable is a large data structure, the cost of serialization is very high and the broadcast variable is used, because the filter class operator causes the result rdd to be sparse, which can generate many empty tasks (or the Execution time <20ms task). The RDD can be re-partitioned with coalesce or repartition, and if the closure of the map operator needs to do heavy initialization and end work (such as connecting MongoDB, closing the connection after counting), these costs are paid for each data record. This work can be done only once for each partition (not each record) with mappartitions or Mapwith, because the partition key does not fit into the data skew and needs to be rewritten because the Straggler-led worker Skew, you can turn on the spark.speculation switch or manually switch off the problematic node; Shuffle between stages requires a lot of data to file, which relies on the OS buffer cache, so don't let the JVM heap fill up the memory, leave 20% To the OS buffer cache, local shuffle files do not write to/tmp, with Spark.local.dir configuration multiple hard drives to ensure high throughput, control the number of reducer, too much task start overhead, too little parallelism; Users often use the Collect operator to retreat into Scala space, execute some logic serially in the driver, slow and prone to memory problems, and try to compute or store data in parallel with the spark operator.
UC Berkeley Amplabkay ousterhout: The next generation of Spark scheduler--sparrow
Kay is the daughter of Stanford professor, TCL/TK and lustre creator John Ousterhout, who has a natural and deep operating system. The current dispatch of Spark is central, and the spark context on one node dispatches a large number of tasks to many worker nodes. Multiple users may use the same spark Context at the same time, causing bottlenecks. On the trend, jobs are getting shorter, from the earliest 10 minutes of MapReduce, to the second level of spark and the sub-second level of spark streaming (<100MS), and as the cluster grows larger and larger than hundreds of units, scheduling becomes a major bottleneck. For version 0.8, the scheduled throughput of Spark is 1500 task/sec, which limits the task's run time and cluster size: If the task runs at 10 seconds, it can support 1000 16-core nodes, and if the task is a second level, it supports up to 100 nodes , and up to 100ms, you can have up to 10 nodes. Optimizing marginal benefits on existing schedulers is limited.
Sparrow came into being, it first let each user have a separate scheduler, in addition, to support a larger cluster or a shorter task, simply add a new scheduler, the overall throughput and fault tolerance are improved. Sparrow designed a scheduler to select the right worker to execute the task with the detection protocol for many workers, as detailed in the video, slides and Sosp ' 13 papers. When the task runs for less than 3 seconds, the Sparrow is better than the current Spark scheduler, which is far superior when it is less than 1.5 seconds. While running Tpc-h Sparrow compared to the theoretical optimal scheduler only 12% of the gap. Jason Dai, chief engineer of Intel Software and services (Dai King): Real-time analytics processing
Jason's team tried to build a stack of streams, interactive queries, multi-iteration calculations (graph computing and machine learning) on spark. He discussed three cases. The first is real-time log aggregation and analysis, using the kafka+spark mini-batch approach and migrating to spark streaming in the future. The second is an interactive query, based on Spark/shark, which specifically describes a time series analysis of a statistical unique event occurrence (such as Unique view), using 2-level Aggregation The third is the complex machine learning and graph calculation, he introduced a n-degree association problem, used for video similarity map to do clustering, do video recommendations. These cases come from real needs such as the production environment of Youku Tudou.