Memory Distributed File System

Last Update:2016-06-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Background

The Spark platform achieves higher computational performance in the Distributed Memory computing model, however, the distributed memory computing model is a double-edged sword, which has to face the problem of distributed data storage while improving performance, the main problems are as follows:

1) When two spark jobs need to share data, the write disk operation must be done. For example: Job 1 First write the generated data to HDFs, then job 2 then read the data from HDFs. Here, the disk reads and writes can cause performance bottlenecks.

2) since Spark uses its own JVM to cache the data, when the Spark program crashes, the JVM process exits and the cached data is lost, so the data needs to be read again from HDFs when the work is restarted.

3) When two spark jobs need to operate the same data, the JVM of each job needs to cache a copy of the data, not only to waste resources, but also easily cause frequent garbage collection, resulting in degraded performance.

After careful analysis of these issues, you can confirm that the root cause of the problem is data storage, because the computing platform tries to manage its own storage so that spark cannot focus on the computation itself, resulting in a decrease in overall execution efficiency. Therefore, a dedicated distributed memory file system is needed to reduce the spark memory pressure while giving spark memory the ability to read and write data quickly and easily, separating the functions of storage and data reading and writing from Spark, so that spark is more focused on the computing itself. In order to achieve higher execution efficiency through a finer division of labour.

2. Research

The industry's most popular distributed memory file system,-tackyon, has been deployed and configured in a series of environments, tested with spark-sql, without increasing speed.

Tachyon is also the default Off-heap memory scheme for spark, so Spark's installation package has build-in the Tachyon Client. Configuring spark for testing has a significant increase in speed, but with limited functionality.

At the same time, another memory distributed File system-ignite file System (IGFS) was investigated, and IGFS provided functionality similar to Hadoop HDFS, but only in memory. In fact, in addition to his own Api,igfs, Hadoop's filesystem API has been implemented to transparently add to the operating environment of Hadoop or spark.
IGFs splits the data of each file into separate pieces of data and then saves them in a distributed memory cache. However, unlike Hadoop HDFs, IGFs does not need a name node, which automatically determines the location of file data through a hash function.
The IGFS can be deployed independently or on HDFs, where he becomes a transparent cache layer of files stored in HDFs.

3. Test results

1) Ignite Test

Use the chauffeur-drive test case to compare the time spent in different data volumes and the Spark configuration environment, plus the COUNT (*) statement, as shown in the execution result:

Based on the test results, the computational speed of ignite is not significantly improved in the case of storage and computation in the same cluster. Considering the actual application, the amount of business party data that is calculated across the cluster is too large to be used in memory storage.

2) Spark built-in Tachyon client test

4. Change the table structure to a non-map structure

Observing the application, a time-consuming statement that queries a column in the map structure table, attempts to extract that column, creates a table with a non-map structure, and makes the same query with a significant increase in speed:

Memory Distributed File System

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Memory Distributed File System

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Memory Distributed File System

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support