This is a creation in Article, where the information may have evolved or changed.
Golang is proven to be ideal for concurrent programming, and goroutine is more readable, elegant, and efficient than asynchronous programming. This paper presents a pipeline execution model for Golang implementation, which is suitable for batch processing of large amount of data (ETL) scenarios.
Imagine an application scenario like this:
(1) Load user reviews from database A (MySQL) (large volume, e.g. 1 billion);
(2) According to the user ID of each comment, from database B (MySQL) associated user data;
(3) Call NLP Service (Natural language processing) and process each comment;
(4) writes the processing result to database C (ElasticSearch).
Due to the various problems encountered in the application, these requirements are summed up:
Requirement one: When a problem occurs (for example, any database failure) is interrupted, using checkpoint to recover from the interrupt.
Demand two: Each process set a reasonable number of concurrent, so that the database and NLP services have a reasonable load (without affecting other services on the basis of, as much as possible to occupy more resources to improve ETL performance). For example, step (1)-(4) Sets the concurrency number 1, 8, 32, 2, respectively.
This is a typical pipeline (pipelining) execution model. Each batch of data (for example, 100) as a product line, 4 steps corresponding to the pipeline 4 processing processes, each process completed after the semi-finished products to the next process. The number of products that can be processed at the same time per operation varies.
Let me paste the code first and then parse the meaning. This is a Golang implementation of pipeline that can be used directly:
Package mainimport ' Sync ' func hasclosed (c <-chan struct{}) bool {Select {case <-c:return true Default:r Eturn false}}type Syncflag interface{Wait () Chan () <-chan struct{} done () Bool}func Newsyncflag () (Done F UNC (), flag Syncflag) {f: = &syncflag{C:make (chan struct{}),} return F.done, F}type Syncflag Stru CT {once sync. Once C Chan Struct{}}func (f *syncflag) done () {F.once.do (func () {Close (F.C)})}func (f *syncflag) Wait () {<-f.c}func (f *syncflag) Chan () <-chan struct{} {return F.c}func (f *syncflag) done () bool {return have Closed (F.C)}type pipelinethread struct {sigs []chan struct{} chanexit Chan struct{} interrupt Syncflag Setint Errupt func () Err Error}func newpipelinethread (l int) *pipelinethread {p: = &pipelinethread{sigs:make ([]chan struct{}, L), Chanexit:make (Chan struct{}),} p.setinterrupt, P.interrupt = Newsyncflag () for I : = RanGE P.sigs {P.sigs[i] = make (chan struct{})} return p}type Pipeline struct {mtx sync. Mutex Workerchans []chan struct{} PREVTHD *pipelinethread}//Create pipeline, the number of parameters is a sub-procedure per task, each parameter corresponding to the concurrency of the sub-process. Func newpipeline (Workers ... int) *pipeline {if Len (workers) < 1 {Panic ("Newpipeline need aleast One argument")} Workerschan: = Make ([]chan struct{}, Len (workers)) for I: = Range Workerschan {Workerschan[i] = make (chan str uct{}, Workers[i])} PREVTHD: = Newpipelinethread (len (workers)) for _,sig: = Range prevthd.sigs {Close (si g)} close (Prevthd.chanexit) return &pipeline{Workerchans:workerschan, PREVTHD:PREVTHD, }}//pushes a task into the pipeline. If the number of concurrent steps in the first step reaches the set limit, the function will block the wait. If there are other tasks in the pipeline that fail (return non-nil), the task is not executed and the function returns FALSE. Func (P *pipeline) Async (Works ... func () error) bool {If Len (works)! = Len (P.workerchans) {Panic ("Async:argume NTS number not matched to Newpipeline (...)} P.mtx.lock () if P.prevthd.interrupt.done () { P.mtx.unlock () return false} PREVTHD: = P.PREVTHD THISTHD: = Newpipelinethread (len (P.workerchans)) P.PREVTHD = THISTHD P.mtx.unlock () Lock: = func (idx int) bool {select {case <-prevthd.interru Pt. Chan (): Return false case <-prevthd.sigs[idx]://wait for signal} Select {Case <-prevt Hd.interrupt.Chan (): Return False Case p.workerchans[idx]<-struct{}{}://get lock} return True } if!lock (0) {thisthd.setinterrupt () <-prevthd.chanexit thisthd.err = Prevthd.err clo SE (Thisthd.chanexit) return false} Go func () {//watch Interrupt of previous thread select {C ASE <-prevthd.interrupt.chan (): Thisthd.setinterrupt () Case <-thisthd.chanexit:}} () Go func () {var err error for I,work: = Range Works {Close (thisthd.sigs[i])//signal next Threa D If work! = Nil {err = Work ()} if Err! = Nil | | (I+1 < Len (works) &&!lock (i+1)) {Thisthd.setinterrupt () break} <-p.workerchans[i]//release lock } <-prevthd.chanexit if PrevThd.interrupt.Done () {thisthd.setinterrupt ()} If Prevthd.err! = Nil {thisthd.err = Prevthd.err} else {thisthd.err = err} Close (Thisthd.chanexit)} () return true}//waits for all tasks in the pipeline to complete or fail, returns the first error, and returns nil if there is no error. Func (P *pipeline) Wait () error {P.mtx.lock () Lastthd: = P.prevthd p.mtx.unlock () <-lastthd.chanexit re Turn Lastthd.err}
Using this pipeline component, our ETL program will be simple, efficient and reliable, freeing the programmer from tedious concurrency control:
Package Mainimport "Log" Func main () {checkpoint: = Loadcheckpoint ()//operations (1) executed outside of pipeline, last operation is save checkpoint Pipeline: = Newpipeline (8, 2, 1) for {//(1)//load 100 data, and modify variables checkpoint//data is an array, each element is a comment, The following tables and NLP directly modify each record in the data. Data, err: = Extractreviewsfroma (&checkpoint, +) if err! = Nil {log. Print (err) break} curcheckpoint: = checkpoint OK: = pipeline. Async (func () error {//(2) return Joinuserfromb (data)}, func () error {//(3) Return NLP (Data)}, func () error {//(4) return Loaddatatoc (data)}, func () Err or {//(5) Save checkpoint return Savecheckpoint (curcheckpoint) log. Print ("Done:", Curcheckpoint)}) if!ok {break} If Len (data) < {break}//finished} ERR: = Pipeline. Wait () if err! = Nil {log. Print (ERR)}}
Pipeline characteristics of the execution model:
1, pipeline respectively control the number of concurrent each operation, if (4) The number of concurrent is full, a thread (3) Even if the completion will block the wait until (4) there is a thread to complete.
2, in the above scenario, pipeline at the same time processing 1+8+32+2+1=44 a total of 4,400 records, memory overhead can be controlled.
3, the schedule of each operation of each thread, not earlier than the same operation of the previous thread scheduling.
For example: There are two threads executing,<1> first, <2> after execution. If <2> (4) is completed earlier than <1> (4), then <2> must block the wait until <1> (4) completes, <1> (5) starts, and <2> (5) starts. Because the maximum number of concurrent (5) is 1, the <2> (5) must wait for <1> (5) to finish before it starts. This mechanism ensures that the order in which the checkpoint is executed must be in the order of async, avoiding interruptions and continuing the process of data leakage.
4. If one of the operations of a thread fails to process (for example, a database failure), then the thread will abort execution and the next call to Async returns False,pipeline. Wait () returns the first error and the entire pipeline operation is controlled for interruption.
For example: There are three threads executing:<1>, <2>, <3>. If <2> (4) fails (LOADDATATOC returns error non-nil), <3> no matter which operation is being performed, it will not break into the next operation. <1> won't be affected and will be executed all the time. Wait () waits for <1><2><3> all to complete or abort, returning the LOADDATATOC error.
5, can not avoid the interruption of the process of checkpoint data after writing. The next time you restart the program, the data will be written back and overwritten.
For example,:<2> (4) failed, <3> (4) execution succeeded (data is written), <2> (5) and <3> (5) are not executed, Checkpoint's latest state is <1> written, The next reboot will re-execute <2> and <3>, where <3> 's data will be written again, so the write should be overwritten by the record ID.
Summary: The pipeline execution model, in addition to limiting the number of concurrent numbers, can also limit memory overhead and have sufficient consideration for failure recovery.