Spark is reading in memory, which is much faster than hadoop. But why do you need to adjust it?
The default principle of performing multiple operators (function operations) for an RDD in Spark is this: every time you perform an operator operation on an RDD, it will be calculated again from the source, calculate the RDD, and then Perform your operator operations on this
RDD. The performance of this method is very poor.
Therefore, for this situation, our recommendation is to persist the RDD that is used multiple times.
The first thing to realize is that .Spark itself is a memory-based iterative calculation, so if the program has only one Action operation from beginning to end and the child RDD only depends on a parent RDD, there is no need to use the cache mechanism, RDD will Calculate from the beginning to the end in memory, and finally return a value according to your Action operation or save it to the corresponding disk. What needs to be cached is when there are multiple Action operations or depend on multiple RDDs, can be before that Cache RDD.
val rdd = sc.textFile("path/to/file").Map(...).filter(...)
val rdd1 = rdd.Map(x => x+1)
val rdd2 = rdd.Map(x => x+100)
val rdd3 = rdd1.join(rdd2)
rdd3.count()
There are 2 RDDs that depend on rdd, so you can use the cache function to cache rdd after rdd is generated, this time you don’t need to start from scratch. In addition to the cache function, the cache can also use persist, cache is the default used The cache option generally defaults to Memory_only (in-memory cache), and persist can choose any type of cache when caching. In fact, the cache internally calls the default persist.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.