Strata+hadoop World 2016 has just ended in San Jose. For big data practitioners, this is a must-have-attention event. One of them is keynote, the Michael Franklin of Berkeley University about the future development of Bdas, very noteworthy, you have to ask me why? Bdas is a set of open-source software stacks for Big Data analytics at Berkeley's Amplab, including the bursting spark of the two years of fire and the rising distributed Memory System Alluxio (Tachyon), Of course also includes the famous resource management of open source software Mesos. It can be said that Amplab in recent years led the big data development of the wave of technological innovation, their bdas of the future of some development and technology introduction, how can not pay attention to it?
The new Bdas
Keynotes the things introduced in, can be found on the Amplab website, the following figure is the Bdas technology stack:
In this entire technology stack, the lowest level is the resource management layer, is also the vast number of big data technology practitioners are aware of the two technologies: Amplab-led development of Mesos and the Hadoop community yarn, both have their advantages and disadvantages, the author in the last year's public number has also done some introduction, here not detailed to repeat.
At the top of the resource management layer is the storage layer, including hdfs,s3,ceph and other technologies, also widely known, Amplab on the Bdas is also used by these widely known distributed file system to solve storage problems. But based on the distributed file system, Amblab made a distributed memory system Alluxio (formerly known as Tachyon). About Alluxio, the domestic big data technology practitioners have had a good understanding, Baidu with Alluxio achieved very good performance improvement, TalkingData is also testing, expect in the near future can be used in our technology stack.
Succinct for many people may be unfamiliar, it is amplab for compressed data efficient retrieval of a set of open-source solutions, the basic starting point is to use a compressed suffix tree (compressed suffix array) to store data to achieve efficient compression storage and retrieval efficiency, Specific technical details, the author will write a separate article on the introduction.
The processing engine is spark core, this does not need me to do more introduction, the domestic article about spark has been numerous, about the technical principle of RDD is basically interview necessary.
In the access and interface layer, spark SQL is the spark community's focus for the past two years, with a lot of technical data, including the Dataframe,dataset concept. Spark streaming has been criticized by some of the recent Spark's information, Spark 2.0 will be a big improvement on the spark streaming, let's wait and see the launch of Spark 2.0.
BLINKDB I was concerned last year, its starting point is to use the sampling method of large data processing, but does not seem to be active, in the Alpha 0.2.0 version has been two years has not changed.
Sampleclean with Ampcrowd is the open source kit for data cleansing, which is a bit similar to the Dayu system that we talkingdat doing, and I'll introduce it separately.
Sparkr does not need me to introduce too much, is to support running R on Spark. GRAPHX is a graph algorithm package on spark, and in the future I believe more and more people will be concerned with the algorithm of the graph.
Splash is a parallel computing framework for random learning algorithms on Spark, supporting SGD,SDCA and so on.
Velox is a set of model systems that Amplab is developing to support real-time personalized predictions, and in this keynote, Michael Franklin focuses on Velox, which is very much appreciated by Amplab, from the source code description, It supports real-time personalized predictions, integrates with Spark and KEYSTONEML, and supports offline batch and online model training. Specific details, the author will be specialized in the introduction of the topic.
KEYSTONEML is a system developed by Amplab to simplify the construction of machine learning pipelines and is still in the process of development. By KEYSTONEML, it is easy to define the pipeline of the machine learning algorithm, and it is convenient to parallelize on spark. In the back I will also be a separate keysoneml introduction.
Mllib does not need to be too much to repeat, is the machine learning algorithm library on Spark, many companies have been using mllib on spark to carry out a variety of machine learning algorithms practice.
Book http://www.biyinjishi.com/products/a65-b6580/d100146/
Micro bo book http://www.biyinjishi.com/products/a65-b6580/d100147/
genealogy http://www.biyinjishi.com/products/a65-b6580/d100149/
Logo Design http://www.biyinjishi.com/products/a70-b7010/
Business Card Design http://www.biyinjishi.com/products/a70-b7015/
Leaflet Design http://www.biyinjishi.com/products/a70-b7020/
Propaganda album design http://www.biyinjishi.com/products/a70-b7025/
Repair film Color http://www.biyinjishi.com/products/a70-b7050/
Typing input http://www.biyinjishi.com/products/a70-b7060/
Document Snapshot http://www.biyinjishi.com/products/a99-b9910/
send and receive fax http://www.biyinjishi.com/products/a99-b9915/
binding and binding http://www.biyinjishi.com/products/a99-b9920/
Lettering Engraved http://www.biyinjishi.com/products/a99-b9925/
Disc Burning http://www.biyinjishi.com/products/a99-b9960/
Say Bdas (Berkeley Data Analytics Stack)