http://blog.csdn.net/pipisorry/article/details/43235263
Introduction
Dpark is a Mesos-based open source distributed computing framework developed by Watercress, a python clone of Spark, Davids's work, beandb author. is a new open source cluster computing framework, similar to MapReduce, but more flexible than it is, it can be easily distributed with Python and provides more functionality for better iterative computation. Dpark's computational model is based on two central ideas: parallel computing for distributed datasets and some limited types of shared variables that can be accessed from different machines during the calculation. Dpark has a very important feature: distributed datasets can be reused in many different parallel loops. This feature distinguishes it from other frameworks such as Hadoop and Dryad in the form of data streams.
{logo is an Amazon river basin of the parcel day piranha, hordes of piranhas can eat a cow in a minute, fully embodies the division of Simple Task group collaboration efficiency and the World's cruelty}
Official Chinese Wiki:https://github.com/jackfengji/test_pro/wiki
Google Group:https://groups.google.com/forum/#!forum/dpark-users
a mapreduce framework that supports iterative computationPdf:http://velocity.oreilly.com.cn/2011/ppts/dpark.pdf Project Address: https://github.com/douban/dpark/
Span style= "COLOR: #FF0000" >spark The difference
Spark runs a task using a thread, but Dpark Affected by python in GIL , choose to use a process to run a task. Spark supports Hadoop 's file system interface,dpark only supports POSIX file interface.
Because of the differences and features of Python and Scala, there are some differences between them:
- The most important difference between the two is the difference between threads and processes. In Spark, a thread is used to run a task, while Dpark is the process used. The reasons are as follows: In Python, because the Gil exists, even if multiple threads are used on multi-core machines, there is no way to actually implement concurrent execution between these threads, and in today's cluster computing, the machines are mostly multicore, and master will assign a task to a CPU running on a compute node, To make full use of each compute node, but because of the Gil's existence, if we use threads to run each task, it can cause at most one thread on the same compute node to be able to run, greatly reducing the computational speed, so we have to use the process to run each task. And this leads to a
cache
relatively complex sharing of memory between the tasks of the same compute node, with some additional overhead, and we're trying to make this overhead as low as possible.
- The supported file systems are different. Spark uses the interface of the file system provided in the Hadoop framework, so spark can support it as long as the file system and file formats supported by Hadoop. Dpark cannot use Hadoop's code and interfaces directly, so it can use only POSIX file systems, or implement specific interfaces for some kind of file system reference Textfile. Currently Dpark supports all file systems that can be accessed in fuse or in a similar way, including MFS, NFS and similar systems, and HDFs has a fuse interface to use. Dpark specifically for the MFS file system implementation of an RDD, it can bypass fuse, get the file distribution information, convenient for the IO local optimization.
from:http://blog.csdn.net/pipisorry/article/details/43235263
Ref:Dpark Installation and related data collation
Beyond MapReduce: Diagram Calculation Framework Overview
Dpark
Spark's Python clone