Memory Distributed File System

Source: Internet
Author: User

1. Background

The Spark platform achieves higher computational performance in the Distributed Memory computing model, however, the distributed memory computing model is a double-edged sword, which has to face the problem of distributed data storage while improving performance, the main problems are as follows:

1) When two spark jobs need to share data, the write disk operation must be done. For example: Job 1 First write the generated data to HDFs, then job 2 then read the data from HDFs. Here, the disk reads and writes can cause performance bottlenecks.

2) since Spark uses its own JVM to cache the data, when the Spark program crashes, the JVM process exits and the cached data is lost, so the data needs to be read again from HDFs when the work is restarted.

3) When two spark jobs need to operate the same data, the JVM of each job needs to cache a copy of the data, not only to waste resources, but also easily cause frequent garbage collection, resulting in degraded performance.

After careful analysis of these issues, you can confirm that the root cause of the problem is data storage, because the computing platform tries to manage its own storage so that spark cannot focus on the computation itself, resulting in a decrease in overall execution efficiency. Therefore, a dedicated distributed memory file system is needed to reduce the spark memory pressure while giving spark memory the ability to read and write data quickly and easily, separating the functions of storage and data reading and writing from Spark, so that spark is more focused on the computing itself. In order to achieve higher execution efficiency through a finer division of labour.

2. Research

The industry's most popular distributed memory file system,-tackyon, has been deployed and configured in a series of environments, tested with spark-sql, without increasing speed.

Tachyon is also the default Off-heap memory scheme for spark, so Spark's installation package has build-in the Tachyon Client. Configuring spark for testing has a significant increase in speed, but with limited functionality.

At the same time, another memory distributed File system-ignite file System (IGFS) was investigated, and IGFS provided functionality similar to Hadoop HDFS, but only in memory. In fact, in addition to his own Api,igfs, Hadoop's filesystem API has been implemented to transparently add to the operating environment of Hadoop or spark.
IGFs splits the data of each file into separate pieces of data and then saves them in a distributed memory cache. However, unlike Hadoop HDFs, IGFs does not need a name node, which automatically determines the location of file data through a hash function.
The IGFS can be deployed independently or on HDFs, where he becomes a transparent cache layer of files stored in HDFs.

3. Test results

1) Ignite Test

Use the chauffeur-drive test case to compare the time spent in different data volumes and the Spark configuration environment, plus the COUNT (*) statement, as shown in the execution result:

Based on the test results, the computational speed of ignite is not significantly improved in the case of storage and computation in the same cluster. Considering the actual application, the amount of business party data that is calculated across the cluster is too large to be used in memory storage.

2) Spark built-in Tachyon client test

4. Change the table structure to a non-map structure

Observing the application, a time-consuming statement that queries a column in the map structure table, attempts to extract that column, creates a table with a non-map structure, and makes the same query with a significant increase in speed:

Memory Distributed File System

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.