Go language concurrency Model: taking parallel processing MD5 as an example

Source: Internet
Author: User
Tags readfile
This is a creation in Article, where the information may have evolved or changed.

Brief introduction

The concurrency primitives of the go language allow developers to build data pipelines in a way similar to Unix Pipe, which efficiently leverages the benefits of I/O and multicore CPUs.

This article is to talk about some examples of the use of pipelining, pipeline error handling is also the focus of this article.

Read suggestions

This article is "Go language concurrency Model: using Channel as UNIX pipe"
The lower part of the article, but the focus is on practice. If you are already familiar with the channel, you can read it independently.
If you're not familiar with the channel and go two keywords, it's recommended to read the upper part first.

The example used in this article is the MD5 value of the batch calculation file, which implements the Md5sum command under Linux.
We'll start with the single-threaded version of Md5sum, and step into the concurrency of the beginner and advanced versions.

Most of the explanations in this article are based on code. The implementation of the three versions of md5sum can be downloaded in the "RELATED links" at the end of the article.

Single-threaded version of Md5sum

MD5 is a hash algorithm widely used in file checking. The md5sum command under Linux prints the MD5 value of a set of files. It is used in the following ways:

% md5sum *.goc33237079343a4d567a2a29df0b8e46e  bounded.goa7e3771f2ed58d4b34a73566d93ce63a  parallel.go1dc687202696d650594aaac56d579179  serial.go

Our sample program is similar to md5sum, but it receives the folder as an argument and prints out the MD5 value for each file, and the print results are sorted by path.
The following example prints the MD5 value of all the files in the current directory:

% go run serial.go .c33237079343a4d567a2a29df0b8e46e  bounded.goa7e3771f2ed58d4b34a73566d93ce63a  parallel.go1dc687202696d650594aaac56d579179  serial.go

The main function of the program calls the helper function Md5all, which returns a mapping of the path name to the MD5 value. After sorting the results in the main function, print out:

func main() {    // 计算特定目录下所有文件的 md5值,     // 然后按照路径名顺序打印结果    m, err := MD5All(os.Args[1])    if err != nil {        fmt.Println(err)        return    }    var paths []string    for path := range m {        paths = append(paths, path)    }    sort.Strings(paths)    for _, path := range paths {        fmt.Printf("%x  %s\n", m[path], path)    }}

In this article, the function Md5all is the focus of the discussion. In the implementation of SERIAL.GO, we did not use concurrency, but read and compute filepath on a case-by-case basis. Walk the generated directories and files. The code is as follows:

// MD5All 读取 root 目录下的所有文件,返回一个map// 该 map 存储了 文件路径到文件内容 md5值的映射// 如果 Walk 执行失败,或者 ioutil.ReadFile 读取失败,// MD5All 都会返回错误func MD5All(root string) (map[string][md5.Size]byte, error) {    m := make(map[string][md5.Size]byte)    err := filepath.Walk(root, func(path string, info os.FileInfo, err error) error {        if err != nil {            return err        }        if !info.Mode().IsRegular() {            return nil        }        data, err := ioutil.ReadFile(path)        if err != nil {            return err        }        m[path] = md5.Sum(data)        return nil    })    if err != nil {        return nil, err    }    return m, nil}

In the above code, filepath. Walk receives two parameters, file paths, and function pointers.
As long as the function signature and return value are satisfied func(string, os.FileInfo, error) error , they can be passed as a second parameter to filepath. Walk.

Click Serial.go to download the single-threaded version of md5sum.

Concurrent version of Md5sum

Click Parallel.go to download the code for the concurrent version of Md5sum.

In this version of the implementation, we cut the md5all into two stages of the pipeline.
The first stage is Sumfiles, which traverses the file tree, each of which calculates the MD5 value in a new goroutine, and then sends the result to a channel of result type.
The result type is defined as follows:

type result struct {    path string    sum  [md5.Size]byte    err  error}

Sumfiles returns two channel, one for receiving the results of the MD5 calculation, and one for receiving filepath. Walk generated errors.
The Walk function creates a goroutine for each file and then checks the done channel. If the done channel is closed, the walk function immediately stops executing. The code examples are as follows:

Func sumfiles (Done <-chan struct{}, root string) (<-chan result, <-chan error) {//For each ordinary file, start a gorotuine meter    Calculate the file MD5 value,//and then send the result to C.    Walk error results are sent to ERRC. c: = make (chan result) ERRC: = Make (chan error, 1) go func () {var wg sync. Waitgroup ERR: = FilePath. Walk (Root, func (path string, info OS). FileInfo, err Error) error {if Err! = Nil {return err} if!info. Mode (). Isregular () {return NIL} WG. ADD (1) go func () {data, err: = Ioutil. ReadFile (path) Select {Case C <-result{path, MD5. Sum (data), err}: Case <-done:} WG.                Do ()} ()//Done channel is closed when the walk function is terminated by select {Case <-done: return errors. New ("Walk canceled") Default:return nil})//The Walk function has been returned, so all pairs of WG.        The ADD call will end//start a goroutine, which will close C at the end of all sends. Go func () {WG. Wait () Close (c)} ()//here does not require a SELECT statement, should be ERRC buffer pipeline ERRC <-Err} () return C, er Rc

Md5all receives MD5 value from C. Md5all returns early when an error is encountered and closes the done channel with the defer statement:

func MD5All(root string) (map[string][md5.Size]byte, error) {    // MD5All 在函数返回时关闭 done channel    // 在从 c 和 errc 接收数据前,也可能关闭    done := make(chan struct{})    defer close(done)    c, errc := sumFiles(done, root)    m := make(map[string][md5.Size]byte)    for r := range c {        if r.err != nil {            return nil, r.err        }        m[r.path] = r.sum    }    if err := <-errc; err != nil {        return nil, err    }    return m, nil}

Limit the amount of concurrency

In the implementation of concurrent version Md5all (PARALLEL.GO),
We created a goroutine for each file. If a directory contains many large files, oom may occur.

We restrict the allocation of memory by limiting the number of files that are concurrently read. Click Bounded.go
View md5sum that restrict concurrent versions. For the purposes of throttling, we create a fixed number of goroutine for reading files.
The pipeline here consists of three stages: traversing files and directories, reading and calculating MD5 values, collecting and consolidating the results of calculations.

The first stage is walkfiles, which generates a path to each normal file in a directory. The code is as follows:

func walkFiles(done <-chan struct{}, root string) (<-chan string, <-chan error) {    paths := make(chan string)    errc := make(chan error, 1)    go func() {        // Walk 函数返回时,关闭 channel paths        defer close(paths)        // 这里不需要select,因为 errc 是缓冲 channel        errc <- filepath.Walk(root, func(path string, info os.FileInfo, err error) error {            if err != nil {                return err            }            if !info.Mode().IsRegular() {                return nil            }            select {            case paths <- path:            case <-done:                return errors.New("walk canceled")            }            return nil        })    }()    return paths, errc}

The second stage creates a fixed number of Goroutine digester, each digester reads the file name from the paths channel and sends the result to C. The code is as follows:

func digester(done <-chan struct{}, paths <-chan string, c chan<- result) {    for path := range paths {        data, err := ioutil.ReadFile(path)        select {        case c <- result{path, md5.Sum(data), err}:        case <-done:            return        }    }}

Unlike the previous example, here Digester does not turn off the output channel C because multiple digester are sharing this channel.
The close operation is implemented in Md5all, and when all digester runs, Md5all closes the channel. The code is as follows:

    // 启动固定数量的 goroutine 处理文件    c := make(chan result)    var wg sync.WaitGroup    const numDigesters = 20    wg.Add(numDigesters)    for i := 0; i < numDigesters; i++ {        go func() {            digester(done, paths, c)            wg.Done()        }()    }    go func() {        wg.Wait()        close(c)    }()

We can let each digester create and return its own output channel. If we do, we also need additional goroutine to merge the results.

The third stage receives the result from channel C and reads the error message from channel ERRC and performs the check.
The check operation cannot be completed before the C read ends because the Walkfiles function may be blocked from sending data to the downstream stage. The code is as follows:

// ... 省略部分代码 ...    m := make(map[string][md5.Size]byte)    for r := range c {        if r.err != nil {            return nil, r.err        }        m[r.path] = r.sum    }    // Check whether the Walk failed.    if err := <-errc; err != nil {        return nil, err    }    

For the Go language concurrency model, the topic of high concurrency and concurrency control using the go built-in channel type and go keyword is here first.
In the recently released Go 1.7, context support was extensively added to the core library for better control of concurrency and timeouts. But before that,
The Golang.org/x/net/context package is always there, and in the next issue we will discuss the context package and its application scenarios.

RELATED LINKS

    1. Original link

    2. Serial.go

    3. Parallel.go

    4. Bounded.go

    5. Golang.org/x/net/context

Sweep code attention to the public number "go language deep"

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.