Luigi Study 1

Source: Internet
Author: User

I. Introduction of Luigi

Luigi is a Python-based language that helps to build a complex streaming batch task management system. These batch jobs typically have Hadoop job, import and export of database data, or machine learning algorithms, and so on.

Luigi's Github:https://github.com/spotify/luigi

At present, there are some data processing tools, such as hive,pig,cascading, which are lower in abstraction level. Luigi is not to replace them, but to help you manage them, the Luigi task can be a hive query, a Java-written Hadoop job, a Scala-written spark job, or a Python program. Luigi provides workflow management for a large number of interdependent jobs, so programmers can put their energies into the job itself.

There are some similar projects such as Oozie and Azkaban. One important difference is that Luigi is not just for hadoop jobs, it can easily extend other types of tasks.

Second, Luigi's official website Hello World Example

The purpose of the 2.1top artists example

The purpose of this example is to assemble a stream of some production data, then find the first 10 artists and save the final result to the database

2.2Aggregate Artist Streams

classAggregateartists (Luigi. Task): Date_interval=Luigi. Dateintervalparameter ()defoutput (self):returnLuigi. Localtarget ("DATA/ARTIST_STREAMS_%S.TSV"%self.date_interval)defrequires (self):return[Streams (date) forDateinchSelf.date_interval]defRun (self): Artist_count=defaultdict (int) forInputinchSelf.input (): With Input.open ('R') as In_file: forLineinchin_file:timestamp, artist, track=Line.strip (). Split () Artist_count[artist]+ = 1With Self.output (). Open ('W') as Out_file: forArtist, CountinchArtist_count.iteritems ():Print>> out_file, artist, Count

For the explanation of this class:

Requires method: This method specifies the dependencies required for this task, in this case, Aggregatearttists relies on a stream job, and the stream job requires a date as a parameter.

Parameters: Each job can define one or more parameters, which need to be defined at the class level. For example, the above class has a parameter Date_interval

Output method: Defines the place where the job results are saved.

Run method: For normal task, you need to implement the Run method. In the Run method can be anything, you can create sub-processes, for long-time arithmetic operations and so on. For some of the subclass of the task, you don't need to implement the Run method, such as jobtask requires you to implement the Mapper and reducer methods.

Localtarget: This is a built-in class that can help you easily read or write local disks. and ensure that the operation of the disk is atomic.

2.3Streams

classStreams (Luigi. Task): Date=Luigi. Dateparameter ()defRun (self): with Self.output (). Open ('W') as output: for_inchRange (1000): Output.write ('{} {} {}\n'. Format (random.randint (0,999), Random.randint (0,999), Random.randint (0,999)))    defoutput (self):returnLuigi. Localtarget (Self.date.strftime ('DATA/STREAMS_%Y_%M_%D_FAKED.TSV'))

This class has no dependencies and the resulting effect is to produce a result file on the local file system.

2.4 Performing locally

Pythonpath=' Luigi--module top_artists aggregateartists--local-scheduler--date-  .

After execution, a data directory is generated under the current directory, and the contents of the data directory are as follows:

(my_python_env) [Email protected] data]#lsartist_streams_2012- .. TSV STREAMS_2012_06_06_FAKED.TSV streams_2012_06_12_faked.tsv STREAMS_2012_06_18_FAKED.TSV streams_2012_06_24 _FAKED.TSV STREAMS_2012_06_30_FAKED.TSVSTREAMS_2012_06_01_FAKED.TSV STREAMS_2012_06_07_FAKED.TSV streams_2012_06_ 13_FAKED.TSV STREAMS_2012_06_19_FAKED.TSV STREAMS_2012_06_25_FAKED.TSVSTREAMS_2012_06_02_FAKED.TSV streams_2012_ 06_08_FAKED.TSV STREAMS_2012_06_14_FAKED.TSV STREAMS_2012_06_20_FAKED.TSV Streams_2012_06_26_faked.tsvstreams_  2012_06_03_FAKED.TSV STREAMS_2012_06_09_FAKED.TSV STREAMS_2012_06_15_FAKED.TSV STREAMS_2012_06_21_FAKED.TSV STREAMS_2012_06_27_FAKED.TSVSTREAMS_2012_06_04_FAKED.TSV STREAMS_2012_06_10_FAKED.TSV streams_2012_06_16_ FAKED.TSV STREAMS_2012_06_22_FAKED.TSV STREAMS_2012_06_28_FAKED.TSVSTREAMS_2012_06_05_FAKED.TSV streams_2012_06_ 11_FAKED.TSV STREAMS_2012_06_17_FAKED.TSV STREAMS_2012_06_23_FAKED.TSV STREAMS_2012_06_29_FAKED.TSV

Streams_*: is generated by the stream job.

Artist_*: Aggregateartists generated, just a file.

2.5 Extensions

Running the above execution command again finds that nothing has been done because the output of all tasks already exists. This means that the Luigi task is idempotent, meaning that the output of the job should be constant no matter how many times it is executed.

--local-scheduler told Luigi not to connect to scheduler server. This is not a recommended way to run, which is also used in the testing phase.

Luigi Study 1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.