Spark's Python clone

Source: Internet
Author: User
Tags posix

http://blog.csdn.net/pipisorry/article/details/43235263

Introduction

Dpark is a Mesos-based open source distributed computing framework developed by Watercress, a python clone of Spark, Davids's work, beandb author. is a new open source cluster computing framework, similar to MapReduce, but more flexible than it is, it can be easily distributed with Python and provides more functionality for better iterative computation. Dpark's computational model is based on two central ideas: parallel computing for distributed datasets and some limited types of shared variables that can be accessed from different machines during the calculation. Dpark has a very important feature: distributed datasets can be reused in many different parallel loops. This feature distinguishes it from other frameworks such as Hadoop and Dryad in the form of data streams.

{logo is an Amazon river basin of the parcel day piranha, hordes of piranhas can eat a cow in a minute, fully embodies the division of Simple Task group collaboration efficiency and the World's cruelty}


Official Chinese Wiki:https://github.com/jackfengji/test_pro/wiki

Google Group:https://groups.google.com/forum/#!forum/dpark-users

a mapreduce framework that supports iterative computationPdf:http://velocity.oreilly.com.cn/2011/ppts/dpark.pdf Project Address: https://github.com/douban/dpark/


Span style= "COLOR: #FF0000" >spark The difference

Spark runs a task using a thread, but Dpark Affected by python in GIL , choose to use a process to run a task. Spark supports Hadoop 's file system interface,dpark only supports POSIX file interface.

Because of the differences and features of Python and Scala, there are some differences between them:

    1. The most important difference between the two is the difference between threads and processes. In Spark, a thread is used to run a task, while Dpark is the process used. The reasons are as follows: In Python, because the Gil exists, even if multiple threads are used on multi-core machines, there is no way to actually implement concurrent execution between these threads, and in today's cluster computing, the machines are mostly multicore, and master will assign a task to a CPU running on a compute node, To make full use of each compute node, but because of the Gil's existence, if we use threads to run each task, it can cause at most one thread on the same compute node to be able to run, greatly reducing the computational speed, so we have to use the process to run each task. And this leads to a cache relatively complex sharing of memory between the tasks of the same compute node, with some additional overhead, and we're trying to make this overhead as low as possible.
    2. The supported file systems are different. Spark uses the interface of the file system provided in the Hadoop framework, so spark can support it as long as the file system and file formats supported by Hadoop. Dpark cannot use Hadoop's code and interfaces directly, so it can use only POSIX file systems, or implement specific interfaces for some kind of file system reference Textfile. Currently Dpark supports all file systems that can be accessed in fuse or in a similar way, including MFS, NFS and similar systems, and HDFs has a fuse interface to use. Dpark specifically for the MFS file system implementation of an RDD, it can bypass fuse, get the file distribution information, convenient for the IO local optimization.


from:http://blog.csdn.net/pipisorry/article/details/43235263

Ref:Dpark Installation and related data collation

Beyond MapReduce: Diagram Calculation Framework Overview

Dpark


Spark's Python clone

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.