PySpark
Java objects are often used in the development of a program, and PySpark
are built on top of the Java API and created by Py4j JavaSparkContext
.
Here are a few things to be aware of.
1.
Py4j
Only run on
driver
This means worker
that no third-party jar packages can be introduced at this time. Because the pyspark of the worker
node is not the communication process that initiates py4j, the corresponding jar package will not be loaded naturally. Before looking at this part of the document in detail, the system design attempts to worker
use the client mode direct connection hbase in the node to obtain part of the data, thus avoiding the join operation of the whole table, of course, for Python, such operations can only be implemented by introducing a jar package ( Do not consider the thrift method). But after the test jar was written, it was unsuccessful, only to revise the plan, then to check the official documents.
2.
PythonRDD
The prototype is
JavaRDD[String]
All the data passed by PYTHONRDD is encoded by BASE64
3.
PySpark
The methods and anonymous functions in the
cloudpickle
Serialization of
Why a function needs to be serialized, because map
a flatMap
function or lambda expression at this time is required to be passed to each one, or, worder
if the function is useful to a closure, cloudpickle
it can be serialized intelligently. However, please do not use the keyword in the function that needs to be passed, self
because after passing, self
the reference relationship is not clear.
The documentation also mentions that PythonRDD
serialization is customizable, but there is no such requirement at this time, all not tested
code example
Java test Code, compile buildpyspark-test.jar
Package org.valux.py4j; Public class Calculate { publicint sqadd (int x) { return x * x + 1; }}
Python test code, put in file driver.py
from pyspark import Sparkcontext from py4j.java_gateway JAVA_IMPORTSC = Sparkcontext (appname= " py4jtesting " " org.valux.py4j.calculate ) func = SC._JVM. Calculate () print func.sqadd (5) [output "> []
!!! [ERROR usage] Here is a custom method that you would like to increase on each work, as mentioned earlier Pyspark is currently not supported by the rdd = Sc.parallelize ([1, 2, 3 = def foo (x): Java_import (SC._JVM, " org.valux.py4j.calculate " Span style= "color: #000000;" >) func = SC._JVM. Calculate () func.sqadd (x) Rdd = Sc.parallelize ([1, 2, 3
When testing, the submitting program needs to remember to bring the jar package
> bin/spar-submit --driver-class-path pyspark-test.jar driver.py
There is another pit here, before the submission for convenience, has always been using the--jars parameter
--driver-class-path additional jars will only driver
introduce--jars additional jars in all the worker
introduced
The Help document also mentions
--jars comma-separated List of local jars to include on the driver and executor classpaths.
All to steal a lazy to use the--jars, the result has been reported the following error:
Py4j.protocol.Py4JError:Trying to call a package.
It's been tested for a long time.
Reference documents
Https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
Pyspark calling the custom jar package