Conceptual understanding: Functions can access variables outside the function, but changes to variables within the function are not visible outside the function.
RDD related operations need to pass in a custom closure function (closure), if the function needs to access external variables, then you need to follow a certain rule, or throw a run-time exception. When a closure function is passed into a node, the following steps are required: The driver, through reflection, finds all the variables that are accessed by the closure, marshals it into an object, serializes the object, transfers the serialized object over the network to the Worker node, and the worker node deserializes the closure object; The worker node executes the closure function.
Note: Changes to external variables within closures are not fed back to the driver.
In short, it is through the network, passing functions to the Worker node, and then executing. So the passed variables must be serializable, otherwise the delivery fails. When executed locally, the four steps above will still be performed.
Broadcast mechanisms can also do this, but frequent use of broadcasts will make the code less concise, and the purpose of the broadcast design is to cache large data on the node, avoid multiple data transfer, improve computational efficiency, rather than for external variable access.