This is a creation in Article, where the information may have evolved or changed.
Brief introduction
The concurrency primitives of the go language allow developers to build data pipelines in a way similar to Unix Pipe, which efficiently leverages the benefits of I/O and multicore CPUs.
This article is to talk about some examples of the use of pipelining, pipeline error handling is also the focus of this article.
Read suggestions
The data pipeline takes full advantage of multicore features, which are based on channel type and go keyword.
Channel and go run through this article at all times. If you are not familiar with these two concepts, it is recommended to read the previous two articles released by the public: Go Language memory model (up/down).
If you are familiar with the "producer" and "consumer" models in the operating system, it will also help to understand the pipeline in this article.
Most of the explanations in this article are based on code. In other words, if you do not understand some code snippets, it is recommended that you run it on the machine or play.golang.org after you complete the completion. For some of the details that you don't understand, you can manually add some statements to help you understand.
Because the go language concurrency model of the English original go Concurrency patterns:pipelines and cancellation length, this article contains only theoretical deduction and simple examples.
In the next article, we will explain in detail the real-life examples of "parallel MD5".
What is a "pipeline" (pipeline)?
For the concept of "pipelining", there is no formal definition of the go language, it is just one of many concurrent ways. Here I give an unofficial definition: A pipeline is composed of multiple stages, adjacent to the two stages by the channel connection;
Each stage consists of a set of goroutine that are started in the same function. At each stage, these goroutine perform the following three actions:
Receive data from upstream via inbound channels
Perform some operations on the received data, typically generating new data
Send newly generated data to downstream via outbound channels
In addition to the first and last stages, each stage can have any of the inbound and outbound channel.
Obviously, the first stage is only outbound channel, and the last stage is only inbound channel.
We usually call the first stage for "生产者"
or "源头"
, which is called the last stage for "消费者"
or "接收者"
.
First, we demonstrate this concept and its techniques through a simple example. We'll make a real-world example later.
Getting Started with pipelining: calculating the number of squares
Suppose we have an assembly line, which consists of three stages.
The first phase is the Gen function, which converts a set of integers into channel,channel that can send numbers out.
The Gen function first starts a goroutine, which Goroutine sends a number to the channel, closing the channel when the number is sent.
The code is as follows:
func gen(nums ...int) <-chan int { out := make(chan int) go func() { for _, n := range nums { out <- n } close(out) }() return out}
The second stage is the SQ function, which receives an integer from the channel and then returns a channel in which the returned channel can send the square of the received integer.
When its inbound channel is closed and all numbers are sent downstream, the outbound channel is closed. The code is as follows:
func sq(in <-chan int) <-chan int { out := make(chan int) go func() { for n := range in { out <- n * n } close(out) }() return out}
The main function is used to set up pipelining and run the last phase. The final stage receives the numbers from the second stage and prints them one after the other until the inbound channel from the upstream is closed. The code is as follows:
func main() { // 设置流水线 c := gen(2, 3) out := sq(c) // 消费输出结果 fmt.Println(<-out) // 4 fmt.Println(<-out) // 9}
Because the SQ function is the same as the inbound channel and the outbound channel type, it combines any of the sq functions. For example, use as follows:
func main() { // 设置流水线并消费输出结果 for n := range sq(sq(gen(2, 3))) { fmt.Println(n) // 16 then 81 }}
If we modify the Gen function a little bit, we can simulate the lazy evaluation of Haskell. Interested readers can make a toss of their own.
Pipeline advanced: Fan-in and fan-out
fan out : The same channel can be read data by multiple functions until the channel is closed.
This mechanism allows workloads to be distributed to a set of workers for better parallel use of CPU and I/O.
fan in : Data for multiple channel can be read and processed by the same function, and then merged into a channel until all the channel is closed.
The following diagram has an intuitive description of the fan entry:
Let's change the pipeline in the previous example, where we run two sq instances, which read data from the same channel.
Here we introduce a new function merge to "fan in" the result:
func main() { in := gen(2, 3) // 启动两个 sq 实例,即两个goroutines处理 channel "in" 的数据 c1 := sq(in) c2 := sq(in) // merge 函数将 channel c1 和 c2 合并到一起,这段代码会消费 merge 的结果 for n := range merge(c1, c2) { fmt.Println(n) // 打印 4 9, 或 9 4 }}
The merge function converts multiple channel to a channel, which initiates a goroutine for each inbound channel, which is used to set the data
Copy to outbound channel.
The implementation of the merge function is shown in the following code (note the WG variable):
func merge(cs ...<-chan int) <-chan int { var wg sync.WaitGroup out := make(chan int) // 为每一个输入channel cs 创建一个 goroutine output // output 将数据从 c 拷贝到 out,直到 c 关闭,然后 调用 wg.Done output := func(c <-chan int) { for n := range c { out <- n } wg.Done() } wg.Add(len(cs)) for _, c := range cs { go output(c) } // 启动一个 goroutine,用于所有 output goroutine结束时,关闭 out // 该goroutine 必须在 wg.Add 之后启动 go func() { wg.Wait() close(out) }() return out}
In the above code, each inbound channel corresponds to a output
function. output
after all goroutine have been created, the merge launches an additional goroutine,
This goroutine will wait for all inbound channel to end after the send operation, close the outbound channel.
Performing a send operation (ch<-) on a channel that has already been closed causes an exception, so we must ensure that all the send operations end before the channel is closed.
Sync. Waitgroup provides a way for organizations to synchronize.
It ensures that all inbound channel (CS ... <-chan int) in the merge is closed normally, and the out channel is closed after the output goroutine normally ends.
Stop and think for a moment
When using pipelining functions, there is a fixed pattern:
In one phase, when all send operations (ch<-) are finished, close the outbound channel
At one stage, Goroutine will continue to receive data from the Inbount channel until all inbound channel is closed
In this mode, each receiving stage can be written in a range
circular way,
To ensure that all data has been successfully sent downstream, Goroutine can exit immediately.
In reality, the stage does not always receive all the inbound data. This is sometimes the case: the receiver may need only a subset of the data to continue execution.
The more common scenario is that the stage exits prematurely because an error was returned from the previous stage.
In both cases, the receiver should not continue to wait for the value to be passed over.
The result we expect is:当后一个阶段不需要数据时,上游阶段能够停止生产。
In our case, if a phase cannot consume all the inbound data, the Goroutine that attempt to send the data will be permanently blocked. Take a look at the following code snippet:
// 只消费 out 的第一个数据 out := merge(c1, c2) fmt.Println(<-out) // 4 or 9 return // 由于我们不再接收 out 的第二个数据 // 其中一个 goroutine output 将会在发送时被阻塞}
There is obviously a resource leak here. On the one hand goroutine consumes memory and runtime resources, on the other hand the heap references in the Goroutine stack prevent the GC from performing a recycle operation.
Since goroutine cannot be recycled, they have to quit on their own.
We re-organize the different stages in the pipeline to ensure that the downstream phase of data failure, the upstream stage can also be normal exit.
One way is to use a buffered pipe as the outbound channel. The cache can store a fixed number of data.
If the cache is not exhausted, the send operation returns immediately. Look at the following code example:
c := make(chan int, 2) // 缓冲大小为 2c <- 1 // 立即返回c <- 2 // 立即返回c <- 3 // 该操作会被阻塞,直到有一个 goroutine 执行 <-c,并接收到数字 1
If you know the number of values to send when you create the channel, you can simplify the code by using buffer.
We are still using the example of square number, we rewrite the Gen function. We copy this set of integers to a
Buffer channel, thus avoiding the creation of a new goroutine:
func gen(nums ...int) <-chan int { out := make(chan int, len(nums)) for _, n := range nums { out <- n } close(out) return out}
Returning to the blocked Goroutine in the pipeline, let's consider having the merge function return a buffer pipeline:
func merge(cs ...<-chan int) <-chan int { var wg sync.WaitGroup out := make(chan int, 1) // 在本例中存储未读的数据足够了 // ... 其他部分代码不变 ...
Although this approach solves the problem of blocking Goroutine in this program, it is not a good idea in the long run.
The cache size selection of 1 is based on two prerequisites:
We already know that the merge function has two inbound channel
We already know how many values the downstream phase will consume.
This piece of code is fragile. If we pass in a value to the Gen function, or the downstream phase reads less value, Goroutine
will be blocked again.
To fundamentally solve this problem, we need to provide a mechanism for the downstream stage to tell the upstream sender to stop receiving messages.
Let's look at this mechanism.
Explicit cancellation (Explicit cancellation)
When the main function decides to exit and stop receiving any data sent by the out, it must tell the upstream stage of the goroutine to let them discard
The data being sent. The main function implements such a mechanism by sending data to a channel called done. Since there are two potential
The sender is blocked and it sends two values. As shown in the following code:
func main() { in := gen(2, 3) // 启动两个运行 sq 的goroutine // 两个goroutine的数据均来自于 in c1 := sq(in) c2 := sq(in) // 消耗 output 生产的第一个值 done := make(chan struct{}, 2) out := merge(done, c1, c2) fmt.Println(<-out) // 4 or 9 // 告诉其他发送者,我们将要离开 // 不再接收它们的数据 done <- struct{}{} done <- struct{}{}}
The goroutine that sends the data uses a select expression instead of the original operation, and the select expression only receives the out or done
After the data is sent, it will continue. The value type of done is struct{}, because it does not matter what value it sends, it is important that it is sent without sending:
Receiving an event occurs means that the send operation of the channel out is discarded. Goroutine output continues execution based on inbound channel C
Cycle, so the upstream stage is not blocked. (We'll discuss how to get the loop to exit early). The merge function implemented using the Done channel method is as follows:
func merge(done <-chan struct{}, cs ...<-chan int) <-chan int { var wg sync.WaitGroup out := make(chan int) // 为 cs 的的每一个 输入channel // 创建一个goroutine。output函数将 // 数据从 c 拷贝到 out,直到c关闭, // 或者接收到 done 信号; // 然后调用 wg.Done() output := func(c <-chan int) { for n := range c { select { case out <- n: case <-done: } } wg.Done() } // ... the rest is unchanged ...
There is a problem with this approach: each downstream recipient needs to know the number of potentially blocked upstream senders, and then send a signal to those senders to get them out early.
Tracking these numbers at all times is a tedious and error-prone task.
We need a way to stop an unknown number and unlimited number of goroutine from sending data downstream. In the go language, we can close a
Channel implementation, because 在一个已关闭 channel 上执行接收操作(<-ch)总是能够立即返回,返回值是对应类型的零值
. For details about this, click here to view.
In other words, if we close the done channel, we will be able to unblock all senders. The shutdown of a pipe is in fact a broadcast signal to all receivers.
We pass the done channel as a parameter to each function on the pipelining, declaring the close operation of the Do channel through the defer expression.
Therefore, all functions that are called from the main function are able to receive the done signal, and each phase exits normally. After using done to refactor the main function, the code is as follows:
func main() { // 设置一个 全局共享的 done channel, // 当流水线退出时,关闭 done channel // 所有 goroutine接收到 done 的信号后, // 都会正常退出。 done := make(chan struct{}) defer close(done) in := gen(done, 2, 3) // 将 sq 的工作分发给两个goroutine // 这两个 goroutine 均从 in 读取数据 c1 := sq(done, in) c2 := sq(done, in) // 消费 outtput 生产的第一个值 out := merge(done, c1, c2) fmt.Println(<-out) // 4 or 9 // defer 调用时,done channel 会被关闭。}
Now, each stage in the pipeline can be returned when the done channel is closed. The output code in the merge function also returns smoothly because it knows that when the done channel is closed, the upstream sender sq stops sending the data. At the end of execution of the defer expression, the output on all call chains guarantees the WG. Done () is called:
func merge(done <-chan struct{}, cs ...<-chan int) <-chan int { var wg sync.WaitGroup out := make(chan int) // 为 cs 的每一个 channel 创建一个 goroutine // 这个 goroutine 运行 output,它将数据从 c // 拷贝到 out,直到 c 关闭,或者 接收到 done // 的关闭信号。人啊后调用 wg.Done() output := func(c <-chan int) { defer wg.Done() for n := range c { select { case out <- n: case <-done: return } } } // ... the rest is unchanged ...
The same principle, when the done channel is closed, SQ is also able to return immediately. At the end of execution of the defer expression, the sq on all call chains will ensure that the out channel is closed. The code is as follows:
func sq(done <-chan struct{}, in <-chan int) <-chan int { out := make(chan int) go func() { defer close(out) for n := range in { select { case out <- n * n: case <-done: return } } }()
Here, we give a few instructions for building the pipeline:
At the end of all send operations, each phase closes its own outbound channels
Each phase receives data from the inbound channels until the channels is closed or the sender is unblocked.
The pipeline relieves the sender of blocking in two ways:
Provides a large enough buffer to hold the data sent by the sender
When the receiver discards the channel, it explicitly notifies the sender.
Conclusion
This paper introduces some techniques of building data pipelining in go language. Pipeline error handling is more complex, each stage of the pipeline may block downstream data transmission,
Downstream stages may also stop focusing on data that is sent upstream. Above we introduce a "done" signal to all goroutine in the pipeline by closing a channel;
The right way to build a pipeline.
In the next article, we will illustrate some of the concepts and techniques described in this article through a parallel MD5 example.
Original author Sameer Ajmani, translator Oscar
Next preview: Go language concurrency Model: take parallel MD5 computation as an example. English original link
RELATED LINKS
Original link: Https://blog.golang.org/pipel ...
Go concurrency model: http://talks.golang.org/2012/...
Go advanced concurrency Model: Http://blog.golang.org/advanc ...
Sweep code attention to the public number "go language deep"