Luigi Framework--about Python running Spark program

Last Update:2017-06-12 Source: Internet

Author: User

Tags python script

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the goal is to write a Python script that runs the Spark program to count some of the data in HDFs. Reference to other people's code, so the use of the Luigi framework.

As for the principle of Luigi the bottom of some things Google is good. This article is mainly focused on rapid use, know it does not know why.

Python writes spark or mapreduce there are other ways, Google a lot, here with Luigi is just a reference to the code, and the understanding is simple, it is used.

On the code:

 import Luigi, Sys 
 from a datetime import datetime, Timedelta 
 from Luigi.contrib.spark import Pysparktask 
 
 Class Luigibase (pysparktask): 
 date = Luigi. Dateparameter (Default=datetime.now ()) 
 Def main (self, SC, *args): 
 Log_rdd = Sc.textfile (Self.input () [0].P  ATH) 
 #要做的spark操作 
   log_rdd.repartition (1). Saveastextfile (Self.output (). Path) 
 @property 
          def name: 
 return "Luigi_test_{}_username". Format (Format_date (self.date)) 
 Def requires (self): 
 return [Hdfsfiles (Date=self.date)] 
 def output (self): 
 return Luigi.hdfs.HdfsTarget (Files (). Path,fo Rmat=luigi.hdfs.plaindir) 
 
 Class Luigistats (Luigi. Task): 
 now = DateTime.Now () 
 date = Luigi. Dateparameter (Default=datetime (Now.year, Now.month, Now.day)) 
 Def requires (self): 
 return luigibase (self . date) 
 
 If __name__ = = ' __main__ ': 
 Luigi.run (main_task_cls=luigistats)

1. For common Luigi tasks, the key is to implement requires, output, and run three functions on demand, and for Luigi packaged spark tasks, the key is to implement requires, output, and main three functions on demand

The 2.base class inherits the Pysparktask class, and the class has a lot of parameters to set, but as the simplest Luigi example, it's all out, as long as you care about requires, output, and main three functions. The requires can be understood as input, output outputs, and main is the logic to be implemented. The name function is also written, because when the code is pushonline, each job will be named, and the company's name is specified in the job, if the name end is not your user name, the Spark program will error, is not let you run the meaning.

3. The code has two classes, the base and the stats class, the execution logic is this: The main function calls stats, and then discovers the stats class requires (relies on) the base class, just see this dependent output does not exist, if it exists as its own input, and then execute the code in its own class. Executes the base class if it does not exist. The above code in my stats class does not need to execute the above, did not write main, just to check the next base execution, did not execute the base to go.

3. Requires and ouput in the base class are HDFs files, as are the logical and stats classes. The base class needs to inherit the Pysparktask class, while the parameters of Luigi.run () inherit Luigi when needed. Task class, so it was written in two classes, as I understand it myself.

The return value of the 4.requires function cannot be a target object, where the specific understanding is that it cannot be a direct-read HDFs file that can be encapsulated in a class that can have a property of path, which is used to return the address of an HDFs file. Dependency is not limited to one, can be multiple, generate a list to return.

5. If you are not installing spark on your own computer, be aware that because the spark cluster called by Pysparktask is not local, it does not seem to support some operations on the local file, and at the beginning, I wanted to write the results locally, and I couldn't find the output results.

6. The general company has a relative entitlement page to view the operation of Spark and Hadoop programs, and to view logs or whatever.

In the 7.base class you can set the next queue parameter, select your program's running queue, sometimes the default queue seems to be particularly slow, you can set up a different.

Luigi Framework--about Python running Spark program

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More