Suppose we have the following code.
x = sc.textFile(...)
y = x.map(...)
z = x.map(...)
Is x essential for caching here? Does caching x cause spark to read the input file twice?
solution
These things don't necessarily make Spark read the input twice.
List all possible solutions:
Example 1: The file will not be read even once
x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
In this case, it will not do anything, because the conversion will not do anything.
Example 2: File read once
x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
Read the file only once to make it mapped
Example 3: File read twice
x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
println(z.count()) #Action of RDD
Now, as the action is used with conversion, it will only read the input file twice.
Example 4: File read once
x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = y.map(...) #Transformation of RDD
println(z.count()) #Action of RDD
Example 5: File read twice
x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = y.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
println(z.count()) #Action of RDD
Since the action is now used on two different RDDs, it will read it twice.
Example 6: File read once
x = sc.textFile(...) #creation of RDD
y = x.map(...).cache() #Transformation of RDD
z = y.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
println(z.count()) #Action of RDD
Even now, only after the
RDD is executed and stored in memory, two different operations are used. Now, the second operation takes place on the cached RDD.
Edit: Additional information
Therefore, a question arises: what to cache and what not to cache?
Answer: The
RDD you will use repeatedly needs to be cached.
Example 7:
x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
So in this case, because we are using x over and over again. Therefore it is recommended to cache x. Because it does not have to read x from the source again and again. Therefore, if you are dealing with large amounts of data, this will save a lot of time. you.
Suppose you start to use/not use serialization to cache all RDDs in memory/disk. If Spark memory is not enough to perform any tasks, it will start using the LRU (Recently Used) strategy to delete old RDDs. Whenever the deleted RDD is used again, it will perform all the steps from source to RSD conversion.