International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Spark's Python clone

Last Update:2015-01-28 Source: Internet

Author: User

Tags posix

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

http://blog.csdn.net/pipisorry/article/details/43235263

Introduction

Dpark is a Mesos-based open source distributed computing framework developed by Watercress, a python clone of Spark, Davids's work, beandb author. is a new open source cluster computing framework, similar to MapReduce, but more flexible than it is, it can be easily distributed with Python and provides more functionality for better iterative computation. Dpark's computational model is based on two central ideas: parallel computing for distributed datasets and some limited types of shared variables that can be accessed from different machines during the calculation. Dpark has a very important feature: distributed datasets can be reused in many different parallel loops. This feature distinguishes it from other frameworks such as Hadoop and Dryad in the form of data streams.

{logo is an Amazon river basin of the parcel day piranha, hordes of piranhas can eat a cow in a minute, fully embodies the division of Simple Task group collaboration efficiency and the World's cruelty}

Official Chinese Wiki:https://github.com/jackfengji/test_pro/wiki

Google Group:https://groups.google.com/forum/#!forum/dpark-users

a mapreduce framework that supports iterative computationPdf:http://velocity.oreilly.com.cn/2011/ppts/dpark.pdf Project Address: https://github.com/douban/dpark/

Span style= "COLOR: #FF0000" >spark The difference

Spark runs a task using a thread, but Dpark Affected by python in GIL , choose to use a process to run a task. Spark supports Hadoop 's file system interface,dpark only supports POSIX file interface.

Because of the differences and features of Python and Scala, there are some differences between them:

The most important difference between the two is the difference between threads and processes. In Spark, a thread is used to run a task, while Dpark is the process used. The reasons are as follows: In Python, because the Gil exists, even if multiple threads are used on multi-core machines, there is no way to actually implement concurrent execution between these threads, and in today's cluster computing, the machines are mostly multicore, and master will assign a task to a CPU running on a compute node, To make full use of each compute node, but because of the Gil's existence, if we use threads to run each task, it can cause at most one thread on the same compute node to be able to run, greatly reducing the computational speed, so we have to use the process to run each task. And this leads to a cache relatively complex sharing of memory between the tasks of the same compute node, with some additional overhead, and we're trying to make this overhead as low as possible.
The supported file systems are different. Spark uses the interface of the file system provided in the Hadoop framework, so spark can support it as long as the file system and file formats supported by Hadoop. Dpark cannot use Hadoop's code and interfaces directly, so it can use only POSIX file systems, or implement specific interfaces for some kind of file system reference Textfile. Currently Dpark supports all file systems that can be accessed in fuse or in a similar way, including MFS, NFS and similar systems, and HDFs has a fuse interface to use. Dpark specifically for the MFS file system implementation of an RDD, it can bypass fuse, get the file distribution information, convenient for the IO local optimization.

from:http://blog.csdn.net/pipisorry/article/details/43235263

Ref:Dpark Installation and related data collation

Beyond MapReduce: Diagram Calculation Framework Overview

Dpark

Spark's Python clone

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

livejasmin clone array clone clone class clone image deep clone duplicate clone clone made

Python thread pause, resume, exit detail and Example _python 01-18

Python design mode-UML-Package diagrams (Package Diagram) 09-09

Python abstract class (ABC module) 09-18

The difference between OS and sys two modules in Python 04-05

Python: send emails 12-08

Python: Database Operations 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark's Python clone

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support