I. Introduction of Luigi
Luigi is a Python-based language that helps to build a complex streaming batch task management system. These batch jobs typically have Hadoop job, import and export of database data, or machine learning algorithms, and so on.
Luigi's Github:https://github.com/spotify/luigi
At present, there are some data processing tools, such as hive,pig,cascading, which are lower in abstraction level. Luigi is not to replace them, but to help you manage them, the Luigi task can be a hive query, a Java-written Hadoop job, a Scala-written spark job, or a Python program. Luigi provides workflow management for a large number of interdependent jobs, so programmers can put their energies into the job itself.
There are some similar projects such as Oozie and Azkaban. One important difference is that Luigi is not just for hadoop jobs, it can easily extend other types of tasks.
Second, Luigi's official website Hello World Example
The purpose of the 2.1top artists example
The purpose of this example is to assemble a stream of some production data, then find the first 10 artists and save the final result to the database
2.2Aggregate Artist Streams
classAggregateartists (Luigi. Task): Date_interval=Luigi. Dateintervalparameter ()defoutput (self):returnLuigi. Localtarget ("DATA/ARTIST_STREAMS_%S.TSV"%self.date_interval)defrequires (self):return[Streams (date) forDateinchSelf.date_interval]defRun (self): Artist_count=defaultdict (int) forInputinchSelf.input (): With Input.open ('R') as In_file: forLineinchin_file:timestamp, artist, track=Line.strip (). Split () Artist_count[artist]+ = 1With Self.output (). Open ('W') as Out_file: forArtist, CountinchArtist_count.iteritems ():Print>> out_file, artist, Count
For the explanation of this class:
Requires method: This method specifies the dependencies required for this task, in this case, Aggregatearttists relies on a stream job, and the stream job requires a date as a parameter.
Parameters: Each job can define one or more parameters, which need to be defined at the class level. For example, the above class has a parameter Date_interval
Output method: Defines the place where the job results are saved.
Run method: For normal task, you need to implement the Run method. In the Run method can be anything, you can create sub-processes, for long-time arithmetic operations and so on. For some of the subclass of the task, you don't need to implement the Run method, such as jobtask requires you to implement the Mapper and reducer methods.
Localtarget: This is a built-in class that can help you easily read or write local disks. and ensure that the operation of the disk is atomic.
2.3Streams
classStreams (Luigi. Task): Date=Luigi. Dateparameter ()defRun (self): with Self.output (). Open ('W') as output: for_inchRange (1000): Output.write ('{} {} {}\n'. Format (random.randint (0,999), Random.randint (0,999), Random.randint (0,999))) defoutput (self):returnLuigi. Localtarget (Self.date.strftime ('DATA/STREAMS_%Y_%M_%D_FAKED.TSV'))
This class has no dependencies and the resulting effect is to produce a result file on the local file system.
2.4 Performing locally
Pythonpath=' Luigi--module top_artists aggregateartists--local-scheduler--date- .
After execution, a data directory is generated under the current directory, and the contents of the data directory are as follows:
(my_python_env) [Email protected] data]#lsartist_streams_2012- .. TSV STREAMS_2012_06_06_FAKED.TSV streams_2012_06_12_faked.tsv STREAMS_2012_06_18_FAKED.TSV streams_2012_06_24 _FAKED.TSV STREAMS_2012_06_30_FAKED.TSVSTREAMS_2012_06_01_FAKED.TSV STREAMS_2012_06_07_FAKED.TSV streams_2012_06_ 13_FAKED.TSV STREAMS_2012_06_19_FAKED.TSV STREAMS_2012_06_25_FAKED.TSVSTREAMS_2012_06_02_FAKED.TSV streams_2012_ 06_08_FAKED.TSV STREAMS_2012_06_14_FAKED.TSV STREAMS_2012_06_20_FAKED.TSV Streams_2012_06_26_faked.tsvstreams_ 2012_06_03_FAKED.TSV STREAMS_2012_06_09_FAKED.TSV STREAMS_2012_06_15_FAKED.TSV STREAMS_2012_06_21_FAKED.TSV STREAMS_2012_06_27_FAKED.TSVSTREAMS_2012_06_04_FAKED.TSV STREAMS_2012_06_10_FAKED.TSV streams_2012_06_16_ FAKED.TSV STREAMS_2012_06_22_FAKED.TSV STREAMS_2012_06_28_FAKED.TSVSTREAMS_2012_06_05_FAKED.TSV streams_2012_06_ 11_FAKED.TSV STREAMS_2012_06_17_FAKED.TSV STREAMS_2012_06_23_FAKED.TSV STREAMS_2012_06_29_FAKED.TSV
Streams_*: is generated by the stream job.
Artist_*: Aggregateartists generated, just a file.
2.5 Extensions
Running the above execution command again finds that nothing has been done because the output of all tasks already exists. This means that the Luigi task is idempotent, meaning that the output of the job should be constant no matter how many times it is executed.
--local-scheduler told Luigi not to connect to scheduler server. This is not a recommended way to run, which is also used in the testing phase.
Luigi Study 1