Pyspark invoking a custom jar package

Source: Internet
Author: User
Tags pyspark

PySparkJava objects are often used in the development of a program, and PySpark are built on top of the Java API and created by Py4j JavaSparkContext .

Here are a few things to be aware of.

1. Py4jOnly run on driver

This means worker that no third-party jar packages can be introduced at this time. Because the pyspark of the worker node is not the communication process that initiates py4j, the corresponding jar package will not be loaded naturally. Before looking at this part of the document in detail, the system design attempts to worker use the client mode direct connection hbase in the node to obtain part of the data, thus avoiding the join operation of the whole table, of course, for Python, such operations can only be implemented by introducing a jar package ( Do not consider the thrift method). But after the test jar was written, it was unsuccessful, only to revise the plan, then to check the official documents.

2. PythonRDDThe prototype is JavaRDD[String]

All the data passed by PYTHONRDD is encoded by BASE64

3. PySparkThe methods and anonymous functions in the cloudpickleSerialization of

Why a function needs to be serialized, because map a flatMap function or lambda expression at this time is required to be passed to each one, or, worder if the function is useful to a closure, cloudpickle it can be serialized intelligently. However, please do not use the keyword in the function that needs to be passed, self because after passing, self the reference relationship is not clear.

The documentation also mentions that PythonRDD serialization is customizable, but there is no such requirement at this time, all not tested

code example

Java test Code, compile buildpyspark-test.jar

 Package org.valux.py4j;  Public class Calculate {    publicint sqadd (int  x) {        return x * x + 1;    }}

Python test code, put in file   driver.py

 from  pyspark import   Sparkcontext  from  py4j.java_gateway   JAVA_IMPORTSC  = Sparkcontext (appname=  " py4jtesting   "  "  org.valux.py4j.calculate   ) func  = SC._JVM. Calculate ()  print  func.sqadd (5)   [output "> []  

 

  !!! [ERROR usage] Here is a custom method that you would like to increase on each work, as mentioned earlier Pyspark is currently not supported by the   rdd  = Sc.parallelize ([1, 2, 3 =    def   foo (x): Java_import (SC._JVM,  "  org.valux.py4j.calculate   " Span style= "color: #000000;" >) func  = SC._JVM. Calculate () func.sqadd (x) Rdd  = Sc.parallelize ([1, 2, 3

When testing, the submitting program needs to remember to bring the jar package
> bin/spar-submit --driver-class-path pyspark-test.jar driver.py

There is another pit here, before the submission for convenience, has always been using the--jars parameter

--driver-class-path additional jars will only driver introduce--jars additional jars in all the worker introduced

The Help document also mentions

--jars comma-separated List of local jars to include on the driver and executor classpaths.

All to steal a lazy to use the--jars, the result has been reported the following error:

Py4j.protocol.Py4JError:Trying to call a package.

It's been tested for a long time.

Reference documents

Https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

Pyspark calling the custom jar package

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.