Asynchronous split Io.reader in Go language

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Original address

I am already deep in IO when working with any stream data in the Go language . Reader and io. Writer 's flexibility cannot extricate itself. At the same time I was a little more or less suffering, challenge my reader interface in your opinion may feel very simple: that is how to split the read operation.

I don't even know if using the word "split" is correct, I just want to pass IO. Reader reads the received items multiple times and may sometimes require parallel operation. But since readers does not necessarily expose the Seek method to reset the read location, I need a way to replicate it. Or can it be clone or fork?

Present situation

Suppose you have a Web service that allows users to upload a file. This service will store the files in the cloud. However, you need to do some simple processing of this file before storing it. For all subsequent requests, you will have to use IO. Reader to deal with.

Solution Solutions

There is, of course, more than one way to handle this situation. Depending on the type of file, the throughput of the service, and the way the file needs to be handled, may be more appropriate than others. Below, I've given 5 different ways of complexity and flexibility. There are more ways to think about it, but these will be a good starting point.

Solution #1: Simplebytes.Reader

If the source reader has no Seek methods, why not implement one yourself? You can read all the content into one bytes.Reader , and then you want to divide the number of times you can read, as long as you are happy:

func handleUpload(u io.Reader)(err error) {    //capture all bytes from upload    b, err := ioutil.ReadAll(u)    if err != nil {        return err    }    //wrap the bytes in a ReaderSeeker    r := bytes.NewReader(b)    //process the metadata    err = processMetaData(r)    if err != nil {        return err    }    r.Seek(0, 0)    //upload the data    err = uploadFile(r)    if err != nil {        return err    }    return nil}

If the data is small enough, this may be the most convenient choice; you can completely forget bytes.Reader and use *byte slice to replace your work. But if it is a large file, such as video files or RAW format photos. These behemoths will devour your memory, especially if the service also has high traffic characteristics. What's more (not to mention) you can't perform these operations in parallel.

    • Pros: The simplest solution
    • Cons: Syncing, unable to adapt to the many, very large files you expect.

Solution #2: A reliable file system

OK, the file that puts the data on disk (with the help of ioutil.TempFile ), and avoids the hidden dangers of storing the data in memory.

func handleUpload(u io.Reader)(err error) {    //create a temporary file for the upload    f, err := ioutil.TempFile("", "upload")    if err != nil {        return err    }    //destroy the file once done    defer func() {        n := f.name()        f.Close()        os.Remove(n)    }()    //transfer the bytes to the file    _, err := io.Copy(f, u)    if err != nil {        return err    }    //rewind the file    f.Seek(0.0)    //upload the file    err = uploadFile(f)    if err != nil{        return err    }    return nil}

This approach may be the best option if you end up storing the file in a file system running on the service (although it produces a real temporary file), but we assume it will eventually fall on the cloud. Continue, if the file is equally large, it will produce significant, but unnecessary IO. At the same time, you will also face the risk of individual file errors or downtime on your machine, so if your data is sensitive, I do not recommend this approach.

    • Pros: Avoid large memory footprint to save the entire file
    • Cons: synchronization, potentially consuming large amounts of IO, disk space, and data single point of failure

Solution #3: The Duct-tapeio.MultiReader

In some cases, the metadata you need exists in the first few bytes of the file. For example, identifying a file in a JPEG format only needs to check if the first two bytes of the file are 0xFF 0xD8 . This can be done by using io.MultiReader synchronous processing. io.MultiReaderorganize a group of readers to make them look like one. Here is our JPEG example:

func handleUpload(u io.Reader)(err error) {    //read in the first 2 bytes    b := make([]byte, 2)    _, err := u.Read(b)    if err != nil {        return err    }    //check that they match the JPEG header    jpg := []byte{0xFF, 0xD8}    if !bytes.Equal(b, jpg) {        return errors.New("not a JPEG")    }    //glue those bytes back onto the reader    r := io.MultiReader(bytes.NewReader(b), u)    //upload the file     err = uploadFile(r)    if err != nil {        return err    }    return nil}

This is a good technique if you only intend to upload JPEG files. With just two bytes, you can stop the transfer ( Note: The transfer here is not a file upload, but a copy of the file to memory or disk for processing ), without having to copy the entire file into memory or onto disk. You should also find that there are scenarios where this method does not apply. For example, you need to read more file content to collect data, such as counting the number of words. This process can block file uploads and may not be ideal for task-intensive processing. Finally, most third-party packages (and most standard libraries) will completely consume a reader to prevent you from using them in this way io.MultiReader .

Another option is to use bufio.Reader.Peek . Essentially it does the same thing, but you can avoid it MultiReader . In other words, it also allows you to access other useful methods on the reader.

    • Advantages: Fast and dirty read the file header, can be used as a threshold for file upload.
    • Cons: Not available for indefinite long reads, processing entire files, dense tasks, or with many third-party packages.

Solution #4: The Single-split and io.TeeReaderio.Pipe

Back in the case of the big video files discussed earlier, let's change the storyline a little bit. Your users will only be on a flyer format video file, but you want these video files to be played in different formats by your service. For example, you have a third-party transcoding device that io.Reader converts the read MP4 format data to the data output in the WEBM format. Your service will upload the original MP4 and transcoding webm files to the cloud. The previous scenario must perform these operations synchronously, and now you want to do this in parallel.

Look at io. Teereader , its function signature is this: func teereader (R reader, W Writer) reader . This is described in the documentation: Teereader will return the data read from reader R to a reader written to writer W. This is exactly what you need! Now how do you make sure that the data written to W is readable? This is through io. Pipe , which is implemented in IO. Pipewriter and io. A connection is established between the Pipereader (that is, the stack, in and first out). See how the code is implemented:

Func handleupload (U io. Reader) (err error) {//create The pipe and Tee reader PR, pw: = Io. Pipe () tr: = Io. Teereader (U, PW)//create channels to synchronize do: = Make (chan bool) errs: = Make (chan error) defer close Defer close (errs) go func () {//close The pipewriter after the//teereader completes to Trigge R EOF defer PW. Close ()//upload The original MP4 data err: = UploadFile (TR) If err! = nil {errs <-er R return} done <-true} () go func () {//transcode to WebM WEBMR, err: =        Transcode (PR) if err! = Nil {errs <-err return}//upload to storage Err = UploadFile (WEBMR) if err! = Nil {errs <-err return} done <-tr UE} ()//wait until both is done//or a error occurs for C: = 0; C < 2; {Select {case ERR: = <-errs:return err case <-done:c++}} return nil} 

Because uploader is going tr to consume, transcoder receives and processes the same data before storing the data. All operations do not require additional buffer and are executed in parallel. Note that the goroutine is used here to perform these two-day paths. io.Pipein a blocked state until a program writes to it or reads data from it. If you try to perform the same in the same thread io.Pipe , you will get a fatal error: fatal error;all goroutines are asleep - deadlock . Panic. Another point to note is that when you use a pipe, you need to trigger an EOF to close at a suitable time io.PipeWriter . In this capacity, it needs to be TeeReader closed after the end.

This example also uses the channel to synchronize the "doneness" and the error between goroutines. If you expect some more specific values to be returned during execution, you can replace Chan bool with a more appropriate type.

    • Pros: Completely independent, parallel processing of the same data stream
    • Disadvantages, using goroutines and channel to increase complexity

Solution #5: The Multi-split and io.MultiWriterio.Copy

io. Teereader can solve the problem very well when there is only one other streaming consumer. Because the service may also need to handle more tasks in parallel (such as converting to more formats), using the tee overlay will make the code bloated. Look at io. Multiwriter : "A writer that copies writes and provides it to multiple writers." It also uses pipes as the previous method to propagate the data, unlike using the IO. Teereader , use io instead. Copy distributes the data to all pipes. The sample code is as follows:

Func handleupload (U io. Reader) (err error) {//create The pipes mp4r, mp4w: = Io. Pipe () WEBMR, WEBMW: = Io. Pipe () Oggr, OGGW: = Io. Pipe () Wavr, wavw: = Io.    Pipe ()//create channels to syschronize do: = Make (chan bool) errs: = Make (chan error) defer close (done) Defer close (ERR)//spawn all the task Goroutines. These looks identical to//the Teereader example, but pulled off into separate//methods for clarity go uploadmp 4 (mp4r, done, errs) go TRANSCODEANDUPLOADWEBM (WEBMR, do, errs) go Transcodeanduploadogg (WEBMR, done, errs) go t  Ranscodeanduploadwav (WEBMR, do, errs) go func () {//After completing the copy, we need to close//the Pipewriters to propagate the EOF to all//pipereaders to avoid deadlock defer mp4w.close () defer W Ebmw.close () defer oggw.close () defer wavw.close ()//build the multiwriter for all the pipes m W: = Io.  Multiwriter (mp4w, WEBMW, OGGW, WAVW)      Copy the data into the Multiwriter _, err: = Io. Copy (MW, u) if err! = Nil {errs <-err}} ()//wait until all is done/or an E Rror occurs for C: = 0; C < 4; C + + {Select {case ERR: = <-errs:return err case <-done:}} return N Il

This approach is a bit similar to the previous approach, but this approach is significantly more concise when the data needs to be cloned multiple times. Because pipes is used, it is also necessary to use Goroutines and synchronous channel to prevent deadlocks. We closed all the pipes at copy completion.

    • Advantage: You can fork multiple copies of raw data as needed
    • Cons: Excessive reliance on goroutines and channel coordination.

About channels?

Channels is one of the unique, powerful concurrency tools that go offers. It is a bridge between the Goroutines, which takes into account both communication and synchronization. You can create a channel with buffer and without buffer for data sharing. So, why don't I offer a solution that leverages channels, not just as a synchronization?

Looking at some of the standard library's top-level packages, it is found that channels rarely appears in function signatures:

    • time: For Select timeout
    • Reflect: Cause reflection
    • Fmt:for formatting it as a pointer
    • Builtin:expose the close function

io.PipeThe channel is discarded in the implementation and is used sync.Mutex to securely move data between reader and writer. I suspect this is because the performance of the channel is not good, so it is replaced by a mutex here.

When developing a reusable package, I would avoid using channels in my public API like the standard library, but use them internally as a synchronization. If the complexity is low enough, replacing the channel with a mutex may be more desirable. That is to say, in program development, channel is a more perfect abstraction, better use than lock, more flexible.

Throwing bricks to greet the jade

I'm just throwing out a handful of the few ways to deal with io.Reader the data from the fetch, no doubt there are more ways. Go's implicit interface model (implicit interface model) + standard library is used extensively to allow creative combinations of different components without worrying about data. I hope that some of my discussions here will help you, just as they are useful to me.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.