1. Background
The Spark platform achieves higher computational performance in the Distributed Memory computing model, however, the distributed memory computing model is a double-edged sword, which has to face the problem of distributed data storage while improving performance, the main problems are as follows:
1) When two spark jobs need to share data, the write disk operation must be done. For example: Job 1 First write the generated data to HDFs, then job 2 then read the data from HDFs. Here, the disk reads and writes can cause performance bottlenecks.
2) since Spark uses its own JVM to cache the data, when the Spark program crashes, the JVM process exits and the cached data is lost, so the data needs to be read again from HDFs when the work is restarted.
3) When two spark jobs need to operate the same data, the JVM of each job needs to cache a copy of the data, not only to waste resources, but also easily cause frequent garbage collection, resulting in degraded performance.
After careful analysis of these issues, you can confirm that the root cause of the problem is data storage, because the computing platform tries to manage its own storage so that spark cannot focus on the computation itself, resulting in a decrease in overall execution efficiency. Therefore, a dedicated distributed memory file system is needed to reduce the spark memory pressure while giving spark memory the ability to read and write data quickly and easily, separating the functions of storage and data reading and writing from Spark, so that spark is more focused on the computing itself. In order to achieve higher execution efficiency through a finer division of labour.
2. Research
The industry's most popular distributed memory file system,-tackyon, has been deployed and configured in a series of environments, tested with spark-sql, without increasing speed.
Tachyon is also the default Off-heap memory scheme for spark, so Spark's installation package has build-in the Tachyon Client. Configuring spark for testing has a significant increase in speed, but with limited functionality.
At the same time, another memory distributed File system-ignite file System (IGFS) was investigated, and IGFS provided functionality similar to Hadoop HDFS, but only in memory. In fact, in addition to his own Api,igfs, Hadoop's filesystem API has been implemented to transparently add to the operating environment of Hadoop or spark.
IGFs splits the data of each file into separate pieces of data and then saves them in a distributed memory cache. However, unlike Hadoop HDFs, IGFs does not need a name node, which automatically determines the location of file data through a hash function.
The IGFS can be deployed independently or on HDFs, where he becomes a transparent cache layer of files stored in HDFs.
3. Test results
1) Ignite Test
Use the chauffeur-drive test case to compare the time spent in different data volumes and the Spark configuration environment, plus the COUNT (*) statement, as shown in the execution result:
Based on the test results, the computational speed of ignite is not significantly improved in the case of storage and computation in the same cluster. Considering the actual application, the amount of business party data that is calculated across the cluster is too large to be used in memory storage.
2) Spark built-in Tachyon client test
4. Change the table structure to a non-map structure
Observing the application, a time-consuming statement that queries a column in the map structure table, attempts to extract that column, creates a table with a non-map structure, and makes the same query with a significant increase in speed:
Memory Distributed File System