First, the goal is to write a Python script that runs the Spark program to count some of the data in HDFs. Reference to other people's code, so the use of the Luigi framework.
As for the principle of Luigi the bottom of some things Google is good. This article is mainly focused on rapid use, know it does not know why.
Python writes spark or mapreduce there are other ways, Google a lot, here with Luigi is just a reference to the code, and the understanding is simple, it is used.
On the code:
import Luigi, Sys
from a datetime import datetime, Timedelta
from Luigi.contrib.spark import Pysparktask
Class Luigibase (pysparktask):
date = Luigi. Dateparameter (Default=datetime.now ())
Def main (self, SC, *args):
Log_rdd = Sc.textfile (Self.input () [0].P ATH)
#要做的spark操作
log_rdd.repartition (1). Saveastextfile (Self.output (). Path)
@property
def name:
return "Luigi_test_{}_username". Format (Format_date (self.date))
Def requires (self):
return [Hdfsfiles (Date=self.date)]
def output (self):
return Luigi.hdfs.HdfsTarget (Files (). Path,fo Rmat=luigi.hdfs.plaindir)
Class Luigistats (Luigi. Task):
now = DateTime.Now ()
date = Luigi. Dateparameter (Default=datetime (Now.year, Now.month, Now.day))
Def requires (self):
return luigibase (self . date)
If __name__ = = ' __main__ ':
Luigi.run (main_task_cls=luigistats)
1. For common Luigi tasks, the key is to implement requires, output, and run three functions on demand, and for Luigi packaged spark tasks, the key is to implement requires, output, and main three functions on demand
The 2.base class inherits the Pysparktask class, and the class has a lot of parameters to set, but as the simplest Luigi example, it's all out, as long as you care about requires, output, and main three functions. The requires can be understood as input, output outputs, and main is the logic to be implemented. The name function is also written, because when the code is pushonline, each job will be named, and the company's name is specified in the job, if the name end is not your user name, the Spark program will error, is not let you run the meaning.
3. The code has two classes, the base and the stats class, the execution logic is this: The main function calls stats, and then discovers the stats class requires (relies on) the base class, just see this dependent output does not exist, if it exists as its own input, and then execute the code in its own class. Executes the base class if it does not exist. The above code in my stats class does not need to execute the above, did not write main, just to check the next base execution, did not execute the base to go.
3. Requires and ouput in the base class are HDFs files, as are the logical and stats classes. The base class needs to inherit the Pysparktask class, while the parameters of Luigi.run () inherit Luigi when needed. Task class, so it was written in two classes, as I understand it myself.
The return value of the 4.requires function cannot be a target object, where the specific understanding is that it cannot be a direct-read HDFs file that can be encapsulated in a class that can have a property of path, which is used to return the address of an HDFs file. Dependency is not limited to one, can be multiple, generate a list to return.
5. If you are not installing spark on your own computer, be aware that because the spark cluster called by Pysparktask is not local, it does not seem to support some operations on the local file, and at the beginning, I wanted to write the results locally, and I couldn't find the output results.
6. The general company has a relative entitlement page to view the operation of Spark and Hadoop programs, and to view logs or whatever.
In the 7.base class you can set the next queue parameter, select your program's running queue, sometimes the default queue seems to be particularly slow, you can set up a different.
Luigi Framework--about Python running Spark program