Flink provides a distributed cache, similar to Hadoop, that allows users to easily read local files in parallel functions. This feature can be used to share files, including static external data, such as dictionaries or machine-learned regression models.
This caching works as follows: The program registers a file or directory (local or remote file system, such as HDFs or S3), registers the cache file with executionenvironment and names it. When the program executes, Flink automatically copies files or directories to the local file system of all worker nodes. The user function can find the file or directory through this specified name, and then access it from the local file system of the worker node.
The following example uses distributed caching:
Java code:
Registering files or directories in Executionenvironment
Executionenvironment env = Executionenvironment.getexecutionenvironment ();
Register a file from HDFs
env.registercachedfile ("Hdfs:///path/to/your/file", "Hdfsfile")
//Register a locally executable script file
Env.registercachedfile ("File:///path/to/exec/file", "Localexecfile", True)
//Define program code and execute
...
dataset<string> input = ...
dataset<integer> result = Input.map (new Mymapper ());
...
Env.execute ();
Access the cache file or directory (here is a map function) in the user function. This function must inherit richfunction because it needs to read the data using Runtimecontext
Inherit richfunction in order to get Runtimecontext public
final class Mymapper extends Richmapfunction<string, integer> { c1/> @Override public
Void open (Configuration config) {
//access cache file by Runtimecontext and Distributedcache
MyFile = Getruntimecontext (). Getdistributedcache (). GetFile ("Hdfsfile");
Read file (or local directory) ...
}
@Override Public
Integer map (String value) throws Exception {
//Use the contents of the cache file to do some processing
...
}
}
Scala code:
Registering files or directories in Executionenvironment
Val env = executionenvironment.getexecutionenvironment
//Register a file from HDFs
env.registercachedfile ("Hdfs:///path /to/your/file "," Hdfsfile ")
//Register a locally executable script file
env.registercachedfile (" File:///path/to/exec/file "," Localexecfile ", True)
//Define program code and execute
...
Val input:dataset[string] = ...
Val Result:dataset[integer] = Input.map (new Mymapper ()) ...
Env.execute ()
Access the cache file or directory (here is a map function) in the user function. This function must inherit richfunction because it needs to read the data using Runtimecontext
Inherit richfunction in order to get Runtimecontext
class Mymapper extends Richmapfunction[string, Int] {
override def open ( config:configuration): Unit = {
//access cache file via Runtimecontext and Distributedcache
val myfile:file = Getruntimecont Ext.getDistributedCache.getFile ("Hdfsfile")
//Read file (or local directory) ...
}
Override def map (value:string): Int = {
//Use the contents of the cache file to do some processing
...
}
}
for more information on big data, videos and technical exchanges, please Dabigatran:
QQ Group No. 1:295,505,811 (full)
QQ Group number 2:54902210
QQ Group number 3:555684318