Overview of how files are read in Go

Last Update:2018-07-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When I started learning Go, I had a hard time mastering the various APIs and techniques for reading files. I tried to write a multi-core word counting program ([KGRZ/KWC] (HTTPS://GITHUB.COM/KGRZ/KWC)) to show my initial confusion by using a variety of read files in one program. In this year's [Advent of Code] (http://adventofcode.com/2017), there are some problems that require different ways to read the input. I ended up using every technology at least once, and now I'm going to write an understanding of these technologies in this article. The methods I listed are in the order in which I use them, not necessarily in descending order of difficulty. # # Some basic assumptions * All code examples are encapsulated in a ' main () ' function * Most of the time, I will often use "array '" and "Slice ' slice '" to refer to slices, but they are not the same. These [blogs] (https://blog.golang.org/go-slices-usage-and-internals) [articles] (https://blog.golang.org/slices) are two great resources to understand the difference. * I uploaded all the examples to [kgrz/reading-files-in-go] (https://github.com/kgrz/reading-files-in-go). In go, like most low-level languages and some dynamic languages (such as node), a byte stream is returned when the file is read. There is an advantage to not automatically converting read content to strings, which avoids the pressure of expensive string allocations to the GC. In order for this article to have a simple conceptual model, I will use ' string (arrayofbytes) ' to convert the byte array into a string. In general, however, it is not recommended to use this approach in a production environment. # # # Read the whole file into memory by byte # # First, the standard library provides multiple functions and tools to read file data. Let's start with a basic usage provided in the ' OS ' package. This means that there are two prerequisites: 1. The file size does not exceed memory. 2. We can know the size of the file in advance to instantiate a buffer that is large enough to hold the data. To get a handle to a ' Os.file ' object, we can get its size and instantiate a byte-type slice. "' Gofile, err: = OS. Open ("Filetoread.txt") if err! = Nil {fmt. PRINTLN (Err) return}defer file. Close () FileinfO, err: = file. Stat () if err! = Nil {fmt. PRINTLN (err) Return}filesize: = FileInfo. Size () Buffer: = Make ([]byte, FileSize) bytesread, err: = file. Read (buffer) if err! = Nil {fmt. PRINTLN (Err) return}fmt. Println ("bytes read:", Bytesread) fmt. Println ("ByteStream to String:", string (buffer)) "[Basic.go] (https://github.com/kgrz/reading-files-in-go/blob/ MASTER/BASIC.GO) on github### reading a file as a block in most cases, it is no problem to read the entire file at once. Sometimes we want to use a more memory-saving approach. For example, read a file block at a certain size and process the file block, and then repeat until the entire file is read. The following example uses a 100-byte buffer size. "' Goconst buffersize = 100file, err: = OS. Open ("Filetoread.txt") if err! = Nil {fmt. PRINTLN (Err) return}defer file. Close () Buffer: = Make ([]byte, buffersize) for {bytesread, err: = file. Read (buffer) if err! = Nil {if err! = Io. EOF {fmt. PRINTLN (ERR)} break} fmt. Println ("bytes read:", Bytesread) fmt. Println ("ByteStream to String:", String (Buffer[:bytesread])} "[Reading-chunkwise.go] (https://github.com/kgrz/ READING-FILES-IN-GO/BLOB/MASTER/READING-CHUNKWISE.GO) on GitHub the main difference compared to reading the entire file content is: 1. We continue to readUntil the ' EOF ' tag is read, so we added a specific check ' Err==io. EOF '. If you're new to Go and confused about how to handle errors, check out this article written by Rob Pike: [Errors is values] (https://blog.golang.org/errors-are-values) 2. We define the buffer size so that we can control the "block" size we want. If used properly, this can improve performance because the operating system works by caching the files that are being read. 3. If the file size is not an integer multiple of the buffer size, the last iteration only adds the remaining bytes to the cache, so the slice operation ' buffer[:bytesread ' is required. Under normal circumstances, ' bytesread ' is the same size as the cache. This is similar to the following Ruby code: "' bufsize = 100f = file.new" _config.yml "," r "While ReadString = F.read (bufsize) Break if Readstring.nil ? Puts Readstringend "In each loop, the internal file pointer position is updated. On the next fetch, the data starts at the file pointer offset, reading and returning the buffer size data. This pointer is not created by the programming language, but is created by the operating system. On Linux, this pointer is the file descriptor created by the operating system. All ' read/read ' calls (in Ruby/go, respectively) are internally translated into system calls and sent to the kernel, which is managed by the kernel. # # # Concurrent Read File block if we want to speed up the processing of data blocks mentioned above? One way is to use multiple ' goroutine '! Rather than sequentially reading blocks, we need an extra action to know the offset of each ' routine ' read data. Note that the ' readat ' function and the ' read ' function are slightly different when the cache capacity is greater than the remaining bytes that need to be read. Also note that I do not limit the number of ' goroutine ', it is only determined by the size of the buffer. In fact, this number may have an upper limit. "' Goconst buffersize = 100file, err: = OS. Open ("Filetoread.txt") if err! = Nil {fmt. PRINTLN (Err) return}defer file. Close () FileInfo, err: = file. STAT () if err! = Nil {fmt. PRINTLN (err) Return}filesize: = Int (FileInfo. Size ())//number of Go routines we need to spawn.concurrency: = filesize/buffersize//Check for any left over bytes. Add One more go routine if required.if remainder: = filesize% buffersize; Remainder! = 0 {concurrency++}var wg sync. Waitgroupwg.add (concurrency) for i: = 0; i < concurrency; i++ {go func (chunksizes []chunk, I int) {defer WG. Done () Chunk: = chunksizes[i] Buffer: = Make ([]byte, Chunk.bufsize) bytesread, err: = file. ReadAt (buffer, chunk.offset)//As noted above, ReadAt differs slighly compared to Read when the//output buffer provided is larger than the data that's available//for reading. So, let's return early only if the error was//something other than an EOF. Returning early would run the//deferred function above if err! = Nil && Err! = Io. EOF {fmt. PRINTLN (ERR) return} FMT. Println ("bytes read, string (ByteStream):", Bytesread) fmt. Println ("ByteStream to String:", String (buffer[: Bytesread])) } (chunksizes, i)}wg. Wait () "[Reading-chunkwise-multiple.go] (https://github.com/kgrz/reading-files-in-go/blob/master/ READING-CHUNKWISE-MULTIPLE.GO) on GitHub This is more complicated than any previous method: 1. I tried to create a specific ' goroutine ', depending on the size of the file and the buffer size (in our case, 100). 2. We need a way to ensure that we "wait" for all the ' goroutine ' runs to complete. In this case, I'm using ' waitgroup '. 3. We send an end signal when the ' goroutine ' run is complete, instead of using an infinite loop to wait for the end of the run. We use ' defer ' call ' WG '. Done () ', when ' goroutine ' runs to ' return ', ' WG. Done ' will be called. Note: Always check the number of bytes returned and re-slice the output buffer. # # Scan files you can always use ' read () ' to read files, but sometimes you need a more convenient way. There are some commonly used IO functions in Ruby, such as ' each_line ', ' Each_char ', ' each_codepoint ' and so on. We can implement similar functions using the ' Scanner ' type and the related functions provided in the ' Bufio ' package. ' Bufio. The Scanner ' type implements a function with a parameter of "split" function, and advances pointers based on this function. For example, the built-in ' Bufio. Scanlines ' Split function, which advances the pointer in each iteration until the pointer advances to the next line break. In each step, the ' Bufio. The Scanner ' type provides a function to get a byte array/string between the start and end positions. For example: ' Gofile, err: = OS. Open ("Filetoread.txt") if err! = Nil {fmt. PRINTLN (Err) return}defer file. Close () Scanner: = Bufio. Newscanner (file) scanner. Split (Bufio. scanlines)//Returns A Boolean based on whether there ' s a next instance OF ' \ n '//character in the IO stream. This step also advances the internal pointer//to the next position (after ' \ n ') if it did find that token.read: = Scanner . Scan () if read {fmt. Println ("read byte array:", scanner. Bytes ()) fmt. Println ("read string:", scanner. Text ())}//goto Scan () line, and repeat "[Scanner-example.go] (https://github.com/kgrz/reading-files-in-go/blob/ MASTER/SCANNER-EXAMPLE.GO) on GitHub Therefore, the following code can be used to read the entire file in a row-wise manner: "' gofile, err: = OS." Open ("Filetoread.txt") if err! = Nil {fmt. PRINTLN (Err) return}defer file. Close () Scanner: = Bufio. Newscanner (file) scanner. Split (Bufio. Scanlines)//This was our buffer nowvar lines []stringfor scanner. Scan () {lines = append (lines, scanner. Text ())}fmt. PRINTLN ("Read lines:") for _, Line: = Range lines {FMT. Println (line)} "[Scanner.go] (https://github.com/kgrz/reading-files-in-go/blob/master/scanner.go) on github### The word scan ' Bufio ' package contains several basic pre-defined split functions: 1. Scanlines (default) 2. ScanWords3. Scanrunes (very useful when working with UTF-8 encoding) 4. Scanbytes So, read a file, split it by word and create a columnTable, you can use the following code: "' Gofile, err: = OS. Open ("Filetoread.txt") if err! = Nil {fmt. PRINTLN (Err) return}defer file. Close () Scanner: = Bufio. Newscanner (file) scanner. Split (Bufio. Scanwords) var words []stringfor scanner. Scan () {words = append (words, scanner. Text ())}fmt. Println ("word list:") for _, Word: = Range words {fmt. The PRINTLN (Word)} ' scanbytes ' Split function will give the same output as before we used the ' Read () ' Example. One of the main differences between the two is that each time we need to dynamically append the data to the byte/string array. This can be circumvented by using pre-initialized caching techniques to increase the buffer size only when the data length exceeds the buffer. Use the same example above: "' gofile, err: = OS." Open ("Filetoread.txt") if err! = Nil {fmt. PRINTLN (Err) return}defer file. Close () Scanner: = Bufio. Newscanner (file) scanner. Split (Bufio. Scanwords)//Initial size of our wordlistbuffersize: = 50words: = Make ([]string, buffersize) pos: = 0for scanner. Scan () {if err: = Scanner. ERR (); Err! = Nil {//This error is a non-eof error. End the iteration if we encounter//an error FMT. Println (err) break} Words[pos] = scanner. Text () pos++ if POS >= len (words) {//Expand the buffer by + again newbuf: = Make ([]stRing, buffersize) words = append (words, newbuf ...)}} Fmt. Println ("word list:")//We is iterating only until the value of "POS" because our buffer size//might is more than the Nu Mber of words because we increase the length by//a constant value. Or the scanner loop might ' ve terminated due to an//error prematurely. In this case the "POS" contains the index of the last//successful update.for _, Word: = range words[:p os] {FMT. PRINTLN (Word)} "[Scanner-word-list-grow.go] (https://github.com/kgrz/reading-files-in-go/blob/master/ SCANNER-WORD-LIST-GROW.GO) on GitHub so we significantly reduced the slice "grow" operation, but depending on the cache size and file size, we might have a vacancy at the end of the cache, which is a compromise scenario. # # # Divides long strings into words ' Bufio. Newscanner ' needs to meet ' IO. The type of Reader ' interface as a parameter, which means that it can accept any type that has a ' Read ' method as a parameter. The string utility method in the standard library ' strings. The Newreader ' function returns a type of "reader". We can combine the two to achieve long string segmentation into words: "' gofile, err: = OS. Open ("_config.yml") longstring: = "This is a very long string. Not. " Handle (ERR) var words []stringscanner: = Bufio. Newscanner (Strings. Newreader (longstring)) scanner. Split (Bufio. Scanwords) for SCANner. Scan () {words = append (words, scanner. Text ())}fmt. Println ("word list:") for _, Word: = Range words {fmt. PRINTLN (Word)} ' # # # Scan a comma-delimited string with the basic ' Read () ' function or ' Scanner ' type to manually parse the CSV file/string is cumbersome because the above partition function ' Bufio. Scanwords ' defines a word as a set of characters separated by spaces. Reading a single character and recording the buffer size and position (like lexical parsing/parsing work) requires too much work and action. We can eliminate these tedious operations by defining new segmentation functions. The Split function reads each character sequentially until it encounters a comma, and then returns the detected word when the ' Text () ' or ' Bytes () ' function is invoked. ' Bufio. The Splitfunc ' function signature should look like this: ' (data []byte, ateof bool)--(advance int, token []byte, err Error) ' 1. ' Data ' is the input byte string 2. ' Ateof ' is the flag 3 that indicates whether the input data is ended. The ' advance ' is used to determine the pointer propulsion value based on the length of the current read, using this value to update the position of the data pointer after the loop scan is complete. 4. ' token ' is the data obtained after the scan operation 5. ' Err ' is used to return an error message for the sake of simplicity, I have shown an example of reading a string. A simple reader that implements the above function signature reads the CSV string: "' gocsvstring: =" name, age, occupation "//a anonymous function declaration to avoid repeating Main () Scancsv: = func (data []byte, ateof bool) (advance int, token []byte, err Error) {commaidx: = bytes. Indexbyte (data, ', ') if Commaidx > 0 {//We need to return the next position buffer: = Data[:commaidx] Return commaidx + 1, Bytes. Trimspace (Buffer), nil}//If we are at the end of the string, just return to the entire buffer if ateof {//But only does the when th ere is some data. If not, this might mean//So we ' ve reached the end of our input CSV string if Len (data) > 0 {return len (data), byte S.trimspace (data), nil}}//When 0, nil, Nil was returned, this is a signal to the interface to read//More data in from The input reader. In the this case, the this input was our//string reader and this pretty much would never occur. return 0, nil, nil}scanner: = Bufio. Newscanner (Strings. Newreader (csvstring)) scanner. Split (SCANCSV) for scanner. Scan () {fmt. PRINTLN (scanner. Text ())} ' # # # Ruby Style we've seen a variety of ways to read files in the order of convenience and efficiency. But what if you just want to read the file into the buffer? ' Ioutil ' is a package in the standard library in which functions can be done with a single line of code. # # # reads the entire file ' Gobytes ', err: = Ioutil. ReadFile ("_config.yml") if err! = Nil {log. Fatal (Err)}fmt. PRINTLN ("Bytes read:", Len (Bytes)) fmt. PRINTLN ("string read:", string (bytes)) "is closer to what we see in the Advanced scripting language." # # # Read the entire directory of files needless to say, if you have large files, * * Do not run this script: D ' ' gofilelist, err: = Ioutil. ReadDir (".") If err! = Nil {log. Fatal (Err)}for _, FileInfo: = range FileList {if FileInfo. Mode (). Isregular () {bytes, err: = Ioutil. ReadFile (FileInfo. Name ()) if err! = Nil {log. Fatal (Err)} FMT. PRINTLN ("Bytes read:", Len (Bytes)) fmt. PRINTLN ("string read:", string (bytes)}} "# # # Other useful functions have more functions in the standard library to read the file (or, more accurately, a ' reader '). To avoid this article too long, I listed some of the functions I found: 1. ' Ioutil. ReadAll () ' Inputs a similar ' IO ' object, returning the entire data as a byte array to 2. ' Io. Readfull () ' 3. ' Io. Readatleast () ' 4. ' Io. Multireader ' is useful when combining multiple like ' io ' objects. If you have a list of files that you need to read, you can treat them as a single contiguous block of data without having to manage the transition between the complex front and back files. # # # Update in order to highlight the "read" function, I chose to use the error handling function to print the error and close the file: "' Gofunc handlefn (file *os. File) func (Error) {return func (err error) {if err! = Nil {file. Close () log. Fatal (ERR)}}}//inside the main function:file, err: = OS. Open ("Filetoread.txt") Handle: = Handlefn (file) handle (ERR) "To do this, I missed a key detail: I didn't close the file handle when the error did not occur and the program ran complete. If the program runs multiple times without any errors, it causes the file descriptor to leak. This is by [U/shovelpost] (https://www.reddit.com/r/golang/comments/7n2bee/various_ways_to_read_a_file_in_go/drzg32k/) pointed out on the Reddit. IThe intention is to avoid using ' defer ' because of ' log '. Fatal ' OS was called internally without running the delay function. Exit ', so I chose to explicitly close the file, but ignored the successful run. I have updated the example using ' defer ' and ' return ' instead of ' OS '. Exit ' dependency.

Via:https://kgrz.io/reading-files-in-go-an-overview.html

Author: Kashyap Kondamudi Translator: Althen proofreading: polaris1119

This article by GCTT original compilation, go language Chinese network honor launches

This article was originally translated by GCTT and the Go Language Chinese network. Also want to join the ranks of translators, for open source to do some of their own contribution? Welcome to join Gctt!
Translation work and translations are published only for the purpose of learning and communication, translation work in accordance with the provisions of the CC-BY-NC-SA agreement, if our work has violated your interests, please contact us promptly.
Welcome to the CC-BY-NC-SA agreement, please mark and keep the original/translation link and author/translator information in the text.
The article only represents the author's knowledge and views, if there are different points of view, please line up downstairs to spit groove

350 Reads

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Overview of how files are read in Go

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support