Recently, a number of flow calculation handlers have been written on the Spark-stream, the program architecture is as follows
The program runs on Spark-stream, and my goal is to Kafka, Redis parameters are supported at startup.
Implemented in Scala.
The address of Redis server is written dead, my program wants to move a position, want to change code to compile again.
It was time to spend, and now write to share with you, to improve the efficiency of the latter.
As pictured above Spark is a distributed engine, the Redis pool created in driver is recreated on worker, and the reference article defines a REDIS connection pool management class, Redis pool is a static variable of the class, which is automatically created by the JVM when it is loaded. There is a gap between this and my expectations.
The Redis management object is created in driver, then the object is broadcast, and then the broadcast object is fetched on a worker, which makes the parameter variable, but the Redis management object is only instantiated once on each worker.
Driver
DRIVER specifies serialization, Spark supports two serialization methods, and Java and Kryo,kryo are more efficient.
The information says that the Kryo way needs to register the class, but I did not register also can run successfully.
public static void Main (string[] args) {
if (Args.length < 3) {
& nbsp; System.err.println ("Usage:kafka_spark_redis <brokers> <topics> <redisserver>\n "+
" <brokers> Kafka broker list \ n" +
" <topics> List of topic to consume \ n "+
"<redisServer> redis server address \ n \ nthe");
system.exit (1);
}
/* Parse parameters */
String Brokers = Args[0];
String topics = args[1];
String redisserver = args[2];
//Create stream context, two-second data count batch
sparkconf sparkconf = new sparkconf (). Setappname ("Kafka_spark_redis");
// sparkconf.set ("Spark.serializer", " Org.apache.spark.serializer.JavaSerializer ");//java serial number speed is not kryo faster
Sparkconf.set ("Spark.serializer", "Org.apache.spark.serializer.KryoSerializer");
// sparkconf.set ("Spark.kryo.registrator", "Myregistrator");
Javastreamingcontext jssc = new Javastreamingcontext (sparkconf, Durations.seconds (2));
Javasparkcontext sc = jssc.sparkcontext ();
hashset<string> topicsset = new Hashset<string> (Arrays.aslist (Topics.split (","));
hashmap<string, string> kafkaparams = new hashmap<string, string> ();
Kafkaparams.put ("Metadata.broker.list", brokers);
Kafkaparams.put ("Group.id", "kakou-test");
Redis Connection Pool Management class
Redisclient redisclient = new Redisclient (redisserver);//create Redis Connection Pool management class
Broadcast Reids connection Pool management Object
Final broadcast<redisclient> Broadcastredis = Sc.broadcast (redisclient);
Creating a Stream Processing object
javapairinputdstream<string, string> messages = Kafkautils.createdirectstream (
JSSC,
String.class,/* Kafka Key class * *
String.class,/* Kafka value class * *
Stringdecoder.class,/* Key Decoding class * *
Stringdecoder.class,/* Value decoding class * *
Kafkaparams,/* Kafka parameters, such as setting Kafka broker * *
Topicsset * * Topic name to be consumed
);
//splitting lines into words
Javadstream<string> lines = Messages.map (new function<tuple2<string, String>, string> () {
//@Override
//Kafka came key-value to
Public String Call (tuple2<string, string> tuple2) {
Take value
return tuple2._2 ();
}
});
/* Large number of omitted * *
........
}
Redisclient
Redisclient is the class that you implement, overloading Write/read these two serialization and deserialization functions in your class, and note that if you are Java serializer you need to implement additional interfaces.
Calls to the write serialization function are triggered when driver broadcasts.
public class Redisclient implements Kryoserializable {
public static Jedispool Jedispool;
Public String host;
Public Redisclient () {
Runtime.getruntime (). Addshutdownhook (New Cleanworkthread ());
}
Public redisclient (String host) {
This.host=host;
Runtime.getruntime (). Addshutdownhook (New Cleanworkthread ());
Jedispool = new Jedispool (new Genericobjectpoolconfig (), host);
}
Static class Cleanworkthread extends thread{
@Override
public void Run () {
System.out.println ("Destroy Jedis pool");
if (null!= jedispool) {
Jedispool.destroy ();
Jedispool = null;
}
}
}
Public Jedis getresource () {
return Jedispool.getresource ();
}
public void Returnresource (Jedis Jedis) {
Jedispool.returnresource (Jedis);
}
public void Write (Kryo kryo, output output) {
Kryo.writeobject (output, host);
}
public void Read (Kryo kryo, input input) {
Host=kryo.readobject (input, string.class);
This.jedispool =new Jedispool (New Genericobjectpoolconfig (), host);
}
}
Worker
To get the broadcast variable in the FOREACHRDD, the broadcast variable triggers the call to Redisclient's parameterless deserialization function, and then the deserialization function, and our approach is to create the Redis Pool in the deserialization function.
Standard output, the vehicle license plate and blacklist to match, to match the successful, save to the Redis.
Paircar.foreachrdd (new function2<javarddPublic Void Call (javarddDate Now=new date ();
Rdd.foreachpartition (New voidfunction<iteratorpublic void Call (IteratorString TMP1;
String TMP2;
Date Now=new date ();
Redisclient Redisclient=broadcastredis.getvalue ();
Jedis Jedis=redisclient.getresource ();
......
Redisclient.returnresource (Jedis);
}
});
Conclusion
Spark has encapsulated distributed computing, but many scenarios still need to understand its working mechanism, and many problems and performance optimization are closely related to spark's working mechanism.