Go concurrency mode: pipe and Cancel

Source: Internet
Author: User
Tags exit in readfile
This is a creation in Article, where the information may have evolved or changed.

Original address: http://air.googol.im/2014/03/15/go-concurrency-patterns-pipelines-and-cancellation.html

Translated from Http://blog.golang.org/pipelines.

This is an article on Go Official blog, describes how to use go to write concurrent programs, and according to the program's evolutionary sequence, introduced the different patterns encountered problems and solve problems. The main explanation is how to link different threads with pipeline mode, and how to ensure that all threads and pipeline resources are properly recycled when a thread cancels work.

Go concurrency mode: pipe and Cancel

Author: Sameer ajmani,blog.golang.org, written on March 13, 2014.

Introduced

The go itself provides the concurrency features that make it easy to build pipelines to process streaming data to efficiently leverage I/O and multicore CPUs. This article shows an example of this pipeline and focuses on some of the details to be dealt with when the operation fails, and describes how to clean the error handling techniques.

What is a pipe?

The go language does not explicitly define a pipeline, but rather a pipeline as a class of concurrent programs. In short, a pipeline is a series of state (stage) connected by a channel, and each state is a set of goroutine that run the same function. In each state, the Goroutine

    • Receiving upstream values via the inflow (inbound) channel
    • Run some functions to process the received data, typically generating new values
    • Send value to downstream by outflow (outbound) channel

Each voice will have any incoming or outgoing channel, except for the first state (only the channel is out) and the last state (only into the channel). The first state is sometimes referred to as a source or producer; The last state is sometimes referred to as a slot (sink) or consumer.

Let's start by explaining these ideas and techniques from a simple pipeline example. After that, let's look at some more realistic examples.

To find the square number

Consider a pipe and three states.

The first state, which gen is a function that passes a series of integer one by one into the channel. The gen function starts a goroutine, sends an integer sequence to the channel, and if all the numbers are sent, close the channel:

func gen(nums ...int) <-chan int {    out := make(chan int)    go func() {        for _, n := range nums {            out <- n        }        close(out)    }()    return out}

The second state, sq which receives an integer from one channel and the sum of the integers, is sent to another channel. When the incoming channel is closed and the state has sent all the values to the downstream, close the outflow channel:

func sq(in <-chan int) <-chan int {    out := make(chan int)    go func() {        for n := range in {            out <- n * n        }        close(out)    }()    return out}

The main function establishes a pipeline and executes the final state: receives all values from the second state and prints until the channel is closed:

func main() {    // 建立管道    c := gen(2, 3)    out := sq(c)    // 产生输出    fmt.Println(<-out) // 4    fmt.Println(<-out) // 9}

Because sq there are the same types of incoming and outgoing channel, we can combine them any time. We can also main write a function in the form of a range loop similar to other states:

func main() {    // 建立管道并产生输出    for n := range sq(sq(gen(2, 3))) {        fmt.Println(n) // 16 和 81    }}

Fan out, fan in

Multiple functions can receive data from a channel at the same time, until the channel is closed, which is called fanout . This is a way to distribute work to a group of workers, with the aim of using CPU and I/O in parallel.

A function receives and processes multiple channel inputs at the same time and translates them into an output channel until all the input channel is closed and the output channel is switched off. This condition is called fan-in .

We can change our pipeline to execute two sq instances simultaneously, each reading the data from the same input channel. We also introduce new functions, merge to fan in all the results:

func main() {    in := gen(2, 3)    // 在两个从in里读取数据的Goroutine间分配sq的工作    c1 := sq(in)    c2 := sq(in)    // 输出从c1和c2合并的数据    for n := range merge(c1, c2) {        fmt.Println(n) // 4 和 9, 或者 9 和 4    }}

mergeInitiates a goroutine for each incoming channel and copies the incoming values to the outgoing channel, thereby converting a set of channel to a channel. Once all the Goroutine are started output , the merge function will start a goroutine more, and this goroutine closes the channel after all input channel inputs are completed.

To an already closed channel output will produce an exception (panic), so be sure to ensure that all data is sent before execution is closed. sync.WaitGrouptypes provide a convenient way to ensure this synchronization:

func merge(cs ...<-chan int) <-chan int {    var wg sync.WaitGroup    out := make(chan int)    // 为cs中每个输入channel启动输出Goroutine。output从c中复制数值,直到c被关闭    // 之后调用wg.Done    output := func(c <-chan int) {        for n := range c {            out <- n        }        wg.Done()    }    wg.Add(len(cs))    for _, c := range cs {        go output(c)    }    // 启动一个Goroutine,当所有output Goroutine都工作完后(wg.Done),关闭out,    // 保证只关闭一次。这个Goroutine必须在wg.Add之后启动    go func() {        wg.Wait()        close(out)    }()    return out}

Abruptly closed

There is a pattern in our pipeline function:

    • The status will close their outgoing channel after all the send operations have been completed
    • The status continues to receive values entered from the incoming channel until the channel is closed

This mode allows each receive state to be written as a range loop, and ensures that all goroutine will exit immediately after all the values have been successfully sent downstream.

But the actual pipeline, the state cannot always receive all incoming values. Sometimes this is a design decision: The receiver may need only a fraction of the value for further processing. More commonly, a state exits because of an incorrect number of values that flowed in from an earlier state. In either case, the receiver should not continue to wait for the remaining values, and we want the previous state to stop producing data that is not needed for subsequent states.

In our pipeline example, if a state cannot handle all incoming values, the goroutine that attempt to send those values will be blocked forever:

    // 处理输出的第一个数值    out := merge(c1, c2)    fmt.Println(<-out) // 4 或者 9    return    // 由于我们不再接收从out输出的第二个数值,其中一个输出Goroutine会由于试图发送数值而挂起}

This is a resource leak: Goroutine consumes memory and runtime resources, and heap references in the Goroutine stack hold data that cannot be garbage collected. Goroutine itself cannot be garbage collected, they have to quit on their own (rather than being killed by someone else).

Even if the downstream state cannot receive all incoming values, we still need to let the upstream state of the pipeline exit normally. One way is to modify the outflow channel so that it contains buffers. The buffer can hold a fixed number of values, and when the buffer has space, the send operation is completed immediately (no blocking).

When you create a channel, the buffer can simplify the code if you already know the number of values to send. For example, we can make the gen number of the whole list into the channel buffer without using the new goroutine:

func gen(nums ...int) <-chan int {    out := make(chan int, len(nums))    for _, n := range nums {        out <- n    }    close(out)    return out}

Back to the blockage problem of our pipeline, we can consider giving out merge the channel plus buffer:

func merge(cs ...<-chan int) <-chan int {    var wg sync.WaitGroup    out := make(chan int, 1) // 1个空间足够应付未读的输入    // ... 其余未变 ...

This change certainly fixes the problem of blocking Goroutine in the program, but this is not a good code. The size of the buffer is 1, depending on the total number of values we know we are going merge to have and downstream states to process. This is too fragile: if we gen read more values from the incoming value, or the downstream state, we will still see the goroutine blocked.

Without a buffer, we need to provide a way for the downstream state to notify the sender that the downstream state will stop receiving input.

Clear cancellation

When main you want to exit without receiving all out the values from, you need to tell all goroutine of the upstream state to discard the behavior of trying to send the value. This can be done by sending a value to a channel called. done There are two potentially blocked senders in the example, so done two values are sent:

func main() {    in := gen(2, 3)    // 发布sq的工作到两个都从in里读取数据的Goroutine    c1 := sq(in)    c2 := sq(in)    // 处理来自output的第一个数值    done := make(chan struct{}, 2)    out := merge(done, c1, c2)    fmt.Println(<-out) // 4 或者 9    // 通知其他发送者,该退出了    done <- struct{}{}    done <- struct{}{}}

Send Goroutine Replace the send operation with a select statement, either send the data to out , or process done the value from. donetype is an empty structure because the actual value is not important: The Receive event itself indicates that the action to continue sending to out should be discarded. The output Goroutine will continue to loop through the incoming channel without c blocking the upstream state:

func merge(done <-chan struct{}, cs ...<-chan int) <-chan int {    var wg sync.WaitGroup    out := make(chan int)    // 为每个cs中的输入channel启动一个output Goroutine。outpu从c里复制数值直到c被关闭    // 或者从done里接收到数值,之后output调用wg.Done    output := func(c <-chan int) {        for n := range c {            select {            case out <- n:            case <-done:            }        }        wg.Done()    }    // ... 其余的不变 ...

But there is a problem with this approach: downstream receivers need to know the number of upstream senders that are potentially blocked. Tracking these numbers is not only tedious, but also error-prone.

We need a way to stop the behavior of sending data downstream of the goroutine without knowing and limiting the number of stops. In go, we can do this by closing the channel, because when the channel is closed, the receive job executes immediately and produces a 0 value of the type.

This means that it is main easy to done release all senders by closing the channel. Shutdown is an efficient broadcast signal sent to all senders. We expand each function in the pipeline to receive it as a parameter done and defer execute the close operation when the function exits by the statement, so that main all exit paths trigger all state exits in the pipeline.

func main() {    // 构建done channel,整个管道里分享done,并在管道退出时关闭这个channel    // 以此通知所有Goroutine该推出了。    done := make(chan struct{})    defer close(done)    in := gen(done, 2, 3)    // 发布sq的工作到两个都从in里读取数据的Goroutine    c1 := sq(done, in)    c2 := sq(done, in)    // 处理来自output的第一个数值    out := merge(done, c1, c2)    fmt.Println(<-out) // 4 或者 9    // done会通过defer调用而关闭}

Each state in the pipeline is now free to exit as early as possible: sq it can exit in its loop because we know that if it done has been closed, the upstream state will be closed gen . sqthrough the defer statement, it is guaranteed that the out channel will be closed regardless of the return path.

func sq(done <-chan struct{}, in <-chan int) <-chan int {    out := make(chan int)    go func() {        defer close(out)        for n := range in {            select {            case out <- n * n:            case <-done:                return            }        }    }()    return out}

The following is a list of guidelines for building pipelines:

    • The status will close their outgoing channel after all the send operations have been completed
    • The status continues to receive values entered from the incoming channel until the channel is closed or its sender is released.

The pipeline is either guaranteed enough to store all the buffers that send the data, or to receive a signal from the receiver that the channel should be discarded to ensure that the sender is released.

Make a summary of the table of contents

To consider a more realistic pipeline.

MD5 is a digest algorithm that is often used when validating files. Used on the command line md5sum to print a summary value for a series of files.

Our program is similar md5sum , but the parameter is a directory, then will print out the directory of all the regular file summary values, sorted by the file path name.

Our main function contains an MD5All auxiliary function that returns a path name to the digest value mapping, then sorts and prints the result:

func main() {    // 计算指定目录下所有文件的MD5值,之后按照目录名排序并打印结果    m, err := MD5All(os.Args[1])    if err != nil {        fmt.Println(err)        return    }    var paths []string    for path := range m {        paths = append(paths, path)    }    sort.Strings(paths)    for _, path := range paths {        fmt.Printf("%x  %s\n", m[path], path)    }}

MD5Allfunction is the focus of our discussion. In the serial.go file, a non-concurrent function is implemented, and the directory tree is scanned for simple reading and calculation of each file.

// MD5All读取文件目录root下所有文件,并返回从文件路径到文件内容MD5值的映射。如果扫描目录// 出错或者任何操作失败,MD5All返回失败。func MD5All(root string) (map[string][md5.Size]byte, error) {    m := make(map[string][md5.Size]byte)    err := filepath.Walk(root, func(path string, info os.FileInfo, err error) error {        if err != nil {            return err        }        if info.IsDir() {            return nil        }        data, err := ioutil.ReadFile(path)        if err != nil {            return err        }        m[path] = md5.Sum(data)        return nil    })    if err != nil {        return nil, err    }    return m, nil}

Parallel summary

In parallel.go , we divide the MD5All pipeline into two states. The first state, sumFiles traversing the directory, summarizes each file in a new goroutine, and sends the result to result the type of channel:

type result struct {    path string    sum  [md5.Size]byte    err  error}

sumfiles returns two channel: one to pass result and the other to return filepath. Walk the error. The traversal function starts a new goroutine to process each regular file, and then checks done . If done has been closed, the traversal stops immediately:

Func sumfiles (Done <-chan struct{}, root string) (<-chan result, <-chan error) {//For each regular file, launch a goroutine calculation file within and send the result to C. Send walk results to ERRC c: = make (chan result) ERRC: = Make (chan error, 1) go func () {var wg sync. Waitgroup ERR: = FilePath. Walk (Root, func (path string, info OS). FileInfo, err Error) error {if Err! = Nil {return err} if info. Isdir () {return NIL} WG. ADD (1) go func () {data, err: = Ioutil. ReadFile (path) Select {Case C <-result{path, MD5. Sum (data), err}: Case <-done:} WG.  Done ()} ()//If do is closed, stop walk select {Case <-done:return Errors. New ("Walk canceled") Default:return nil})//Walk has returned, all WG. The work of add is done. Open a new process after all sends have completed//close C.        Go func () {WG.        Wait () Close (c)} ()//Because ERRC has buffers, no select is required here. ERRC <-Err} () return C, ERRC}

MD5Allcall digest values are received from. MD5Allreturns an earlier error, by defer closing done :

func MD5All(root string) (map[string][md5.Size]byte, error) {    // MD5All在返回时关闭done channel;这个可能在从c和errc收到所有的值之前被调用    done := make(chan struct{})    defer close(done)    c, errc := sumFiles(done, root)    m := make(map[string][md5.Size]byte)    for r := range c {        if r.err != nil {            return nil, r.err        }        m[r.path] = r.sum    }    if err := <-errc; err != nil {        return nil, err    }    return m, nil}

Restricted concurrency

parallel.goimplement MD5All a new goroutine for each file that is implemented in. If the directory contains many large files, this may result in the application of large amounts of memory, exceeding the available memory on the machine.

We can limit memory requests by controlling the number of files that are read in parallel. In bounded.go , we create a fixed number of goroutine to read the file to limit memory usage. Now the entire pipeline has three states: traverse the tree, read and digest the file, and collect digest values.

The first state, the walkFiles path of each regular file in the Send tree:

func walkFiles(done <-chan struct{}, root string) (<-chan string, <-chan error) {    paths := make(chan string)    errc := make(chan error, 1)    go func() {        // 在Walk之后关闭paths channel        defer close(paths)        // 因为errc有缓冲区,所以这里不需要select。        errc <- filepath.Walk(root, func(path string, info os.FileInfo, err error) error {            if err != nil {                return err            }            if info.IsDir() {                return nil            }            select {            case paths <- path:            case <-done:                return errors.New("walk canceled")            }            return nil        })    }()    return paths, errc}

The middle state initiates a fixed number of digester Goroutine, from paths receiving the file name and sending the result to the result channel c :

func digester(done <-chan struct{}, paths <-chan string, c chan<- result) {    for path := range paths {        data, err := ioutil.ReadFile(path)        select {        case c <- result{path, md5.Sum(data), err}:        case <-done:            return        }    }}

Unlike the previous example, the digester output channel is not closed because multiple goroutine are sent to the shared channel. On the other side, the MD5All code in will digester close the channel after all is done:

    // 启动固定数量的Goroutine来读取并对文件做摘要。    c := make(chan result)    var wg sync.WaitGroup    const numDigesters = 20    wg.Add(numDigesters)    for i := 0; i < numDigesters; i++ {        go func() {            digester(done, paths, c)            wg.Done()        }()    }    go func() {        wg.Wait()        close(c)    }()

We can also let each digester create and return its own output channel, but this requires a separate goroutine to fan in all results.

Eventually c collect all the results from result and check errc for errors from the incoming. This error cannot be checked early, as it may be blocked by a walkFiles message being sent downstream before this point in time:

    m := make(map[string][md5.Size]byte)    for r := range c {        if r.err != nil {            return nil, r.err        }        m[r.path] = r.sum    }    // 检查Walk是否失败    if err := <-errc; err != nil {        return nil, err    }    return m, nil}

Conclusion

This article shows the techniques for building streaming data pipelines using go. It is prudent to deal with errors caused by such pipelines because each state in the pipeline can be blocked by sending values downstream, while downstream States are no longer concerned with the data being entered. We showed how to broadcast the closed channel as a "complete" signal to all goroutine that were started by the pipeline, and to define a guide to properly build the pipeline.

Further reading:

Go concurrency mode (video) shows the basics of Go concurrency and demonstrates how to apply that knowledge.
Advanced go concurrency mode (video) covers more complex usage scenarios for Go features, especially select.
Douglas McIlroy's paper, "A Peep Series series," shows how this type of concurrency technique used by go can gracefully support complex computations.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.