Using Go to read files-overview

Source: Internet
Author: User
Tags parse csv file readfile
This is a creation in Article, where the information may have evolved or changed. December 30, 2017 January 1, 2018: [UPDATE] (http://kgrz.io/reading-files-in-go-an-overview.html#update: At the end of the article)---when I started to learn go, It is difficult for me to skillfully use the API of various operation files. I was confused when I tried to write a multi-core counter ([KGRZ/KWC] (HTTPS://GITHUB.COM/KGRZ/KWC)-Different ways to manipulate the same file. In this year's [Advent of Code] (http://adventofcode.com/2017/), there are some problems that require multiple ways to read the input source. I finally used each method at least once, so now I have a clear understanding of these techniques. I will record these in this post. I will list them in the order that I encountered them, rather than in the order from easy to difficult. * Read in bytes * Read the entire file into memory * read files in batches * Scan files in parallel * scans * scan by word * Divide a long string into multiple words * scan a comma-delimited string * Ruby style * Read entire file * read all files in Directory * More help method * Update # # Some basic assumptions * All the code is wrapped in the ' main () ' code block * Most of the cases I use "array" and "slice" to refer to slices, but their meanings are different. This (https://blog.golang.org/go-slices-usage-and-internals) Two (https://blog.golang.org/slices) article is very good to explain the difference between the two. * I will upload all the sample code to [Kgrz/reading-files-in-go] (https://github.com/kgrz/reading-files-in-go). In Go-for this problem, most of the low-level languages and some dynamic languages like Node will return a byte stream. The reason why strings are not automatically returned is because you can avoid expensive string assignment operations that increase the pressure on the garbage collector. To make this article more understandable, I'll use ' string ' to convert the ' byte ' array to a string, but it's not recommended in production mode, Arrayofbytes. # # Read in bytes * Read the entire file into memory * standardThe library provides numerous functions and tools to read file data. Let's start with the basic examples provided in the ' OS ' package. This means that there are two prerequisites: 1. The file needs to be put into memory 2. We need to know the file size in advance to instantiate a buffer enough to load the file when we get the ' OS. A handle to the file ' object, we can query the size of the files in advance and instantiate a byte array. "' Gofile, err: = OS. Open ("Filetoread.txt") if err! = Nil {fmt. PRINTLN (Err) return}defer file. Close () FileInfo, err: = file. Stat () if err! = Nil {fmt. PRINTLN (err) Return}filesize: = FileInfo. Size () Buffer: = Make ([]byte, FileSize) bytesread, err: = file. Read (buffer) if err! = Nil {fmt. PRINTLN (Err) return}fmt. Println ("bytes read:", Bytesread) fmt. Println ("ByteStream to String:", string (buffer)) "" In Github View source file [Basic.go] (https://github.com/kgrz/ READING-FILES-IN-GO/BLOB/MASTER/BASIC.GO) # # Read files in batches most of the time we can read this file into memory, but sometimes we want to use a more conservative memory usage policy. such as reading a certain size of file content, processing them, and then loop the process until the end. In the following example, a 100-byte buffer is used. "' Goconst buffersize = 100file, err: = OS. Open ("Filetoread.txt") if err! = Nil {fmt. PRINTLN (Err) return}defer file. Close () Buffer: = Make ([]byte, buffersize) for {bytesread, err: = file. Read (buffer) if err! = Nil {if err! = Io. EOF {fmt. PRINTLN (Err)}break}fmt. Println ("Bytes read:", Bytesread) fmt. Println ("ByteStream to String:", String (Buffer[:bytesread])} "" In Github View source file [Reading-chunkwise.go] (https:// GITHUB.COM/KGRZ/READING-FILES-IN-GO/BLOB/MASTER/READING-CHUNKWISE.GO) differs from reading the entire file in that: 1. When the EOF tag is read, it stops reading, so we add a special assertion of ' err = = Io. EOF '. If you're just starting out with Go and you might be confused about errors's engagement, reading this article from Rob Pike might help you: [Errors is values] (https://blog.golang.org/ Errors-are-values) 2. We define the size of the buffer so that we can control any buffer size. Because of the way the operating system works ([caching a file that's being read] (http://www.tldp.org/LDP/sag/html/buffer-cache.html), you can improve performance if you set it up properly. 3. If the size of the file is not an integer multiple of the buffer size, then the last iteration will only read the remaining bytes into the buffer, so we will call ' Buffer[:bytesread '. Under normal circumstances, ' bytesread ' is the same size as the buffer. This situation is very similar to the following Ruby code: "' cbufsize = 100f = file.new" _config.yml "," r "While ReadString = F.read (bufsize) Break if ReadString . nil?puts Readstringend "In each iteration of the loop, the internal file pointer is updated. When the next read begins, the data starts at the offset of the file pointer until the buffer size is read. This pointer is not a concept in a programming language, but rather a concept in the operating system. In Linux, this pointer refers to the properties of the file descriptor created. All Read/read function calls (in Ruby/go) are internally converted to system calls and sent to the kernel, which are then managed by the kernel. # # Read files in parallel batch how can we speed up the pointsWhat about batch read files? One way is to use multiple go routine. We need to know the offset of each goroutine relative to the sequential read of the file. It is important to note that when the remaining data is less than the buffer, the performance of ' ReadAt ' is [slightly different] from ' Read ' (https://golang.org/pkg/io/#ReaderAt). In addition, I do not set the upper limit of the number of goroutine here, but the size of the buffer is determined by itself. However, in practical applications, the maximum number of goroutine is usually set. "' Goconst buffersize = 100file, err: = OS. Open ("Filetoread.txt") if err! = Nil {fmt. PRINTLN (Err) return}defer file. Close () FileInfo, err: = file. Stat () if err! = Nil {fmt. PRINTLN (err) Return}filesize: = Int (FileInfo. Size ())//The number of goroutine we need to use concurrency: = filesize/buffersize//If there are extra bytes, add an extra goroutineif remainder: = filesize% buffersize; Remainder! = 0 {concurrency++}var wg sync. Waitgroupwg.add (concurrency) for i: = 0; i < concurrency; i++ {go func (chunksizes []chunk, I int) {defer WG. Done () Chunk: = Chunksizes[i]buffer: = Make ([]byte, Chunk.bufsize) bytesread, err: = file. ReadAt (buffer, chunk.offset)//As described above, when the output buffer capacity is larger than the data to be read, the ReadAt and read methods are slightly different. Therefore, when encountering non-EOF type errors, we need to return from the function in advance. In this case, the deferred function executes the IF err! = Nil && Err! = IO before the//main function returns. EOF {fmt. PriNTLN (Err) return}fmt. Println ("bytes read, string (ByteStream):", Bytesread) fmt. Println ("ByteStream to String:", String (Buffer[:bytesread]))} (chunksizes, i)}wg. Wait () "To view source files in Github [Reading-chunkwise-multiple.go] (https://github.com/kgrz/reading-files-in-go/blob/master/ READING-CHUNKWISE-MULTIPLE.GO) This should be considered more than the previous method: 1. I try to create a specific number of go-routines, depending on the size of the file and the size of the buffer (100k in our case). 2. We need a way to be sure that all the goroutines are over. In this example, we use the wait group. 3. We send a signal at the end of each goroutine instead of using ' break ' to jump out of the for loop. Because we call ' WG ' in ' defer '. Done () ', which is called every time it is "returned" from Goroutine. Note: Each time you should check the number of bytes returned and refresh the (reslice) output buffer. # # Scan You can use the ' read () ' method to read files in various scenarios, but sometimes you need some more convenient methods. Like the ' Each_line ', ' Each_char ', ' each_codepoint ' and other IO functions that are often used in Ruby. We can use the ' Scanner ' type and the correlation function in the ' BUFIO ' package to achieve a similar effect. ' Buifo. The Scanner ' type implements a function with the "split" function and updates the pointer position based on the function. such as the built-in ' Bufio. Scanlines ' Split function that points the pointer to the first character in the next line in each iteration. At each step, the type exposes a number of methods to obtain a byte array/string from the starting position to the end position. For example: "' gofile, err: = OS. Open ("Filetoread.txt") if err! = Nil {fmt. PRINTLN (Err) return}defer file. Close () Scanner: = bUfio. Newscanner (file) scanner. Split (Bufio. scanlines)//Returns a Boolean value based on whether the next character in the IO stream is ' \ n '. If the symbol is found,//The step moves the internal pointer to the next position (after ' \ n ') in advance. READ: = Scanner. Scan () if read {fmt. Println ("read byte array:", scanner. Bytes ()) fmt. Println ("read string:", scanner. Text ())}//back to the Scan () line and then executes again. "' View source files in Github [Scanner-example.go] (https://github.com/kgrz/reading-files-in-go/blob/master/ SCANNER-EXAMPLE.GO) So, if you want to read the entire file by line, you can do this: "Gofile, err: = OS. Open ("Filetoread.txt") if err! = Nil {fmt. PRINTLN (Err) return}defer file. Close () Scanner: = Bufio. Newscanner (file) scanner. Split (Bufio. Scanlines)//This is our buffer var lines []stringfor scanner. Scan () {lines = append (lines, scanner. Text ())}fmt. PRINTLN ("Read lines:") for _, Line: = Range lines {FMT. Println (line)} "view source file in Github [Scanner.go] (Https://github.com/kgrz/reading-files-in-go/blob/master/scanner.go) # # Scan by word ' Bufio ' package contains some basic pre-defined partitioning functions: 1. Scanlines (default) 2. ScanWords3. Scanrunes (useful when traversing a UTF-8 string instead of a byte) 4. Scanbytes If you want to get an array of words from a file, you can do this: "Gofile, err: = OS. Open ("FiletOread.txt ") if err! = Nil {fmt. PRINTLN (Err) return}defer file. Close () Scanner: = Bufio. Newscanner (file) scanner. Split (Bufio. Scanwords) var words []stringfor scanner. Scan () {words = append (words, scanner. Text ())}fmt. Println ("word list:") for _, Word: = Range words {fmt. The PRINTLN (Word)} ' scanbytes ' Split function returns the same results as the previous ' Read () ' Example. The main difference between the two is the dynamic allocation problem that exists every time we need to add to the byte/string array in the scanner. We can circumvent this problem by using a technique that pre-defines the size of the buffer and increases its length after the size limit is reached. Examples are: "' gofile, err: = OS. Open ("Filetoread.txt") if err! = Nil {fmt. PRINTLN (Err) return}defer file. Close () Scanner: = Bufio. Newscanner (file) scanner. Split (Bufio. Scanwords)//Initialize our word list buffersize: = 50words: = Make ([]string, buffersize) pos: = 0for scanner. Scan () {if err: = Scanner. ERR (); Err! = Nil {//This is a non-EOF error. If this error is encountered, the loop is ended. Fmt. PRINTLN (err) break}words[pos] = scanner. Text () pos++if pos >= len (words) {//Expand the buffer by + againnewbuf: = Make ([]string, buffersize) words = append (wo RDS, Newbuf ...)}} Fmt. Println ("word list:")//because we expand the buffer by a fixed size, the buffer capacity may be larger than the actual number of words, so we only iterate when "pos"//Valid。 Otherwise, the scanner may terminate prematurely due to an error encountered. In this example, "POS" contains the index for the//last update. For _, Word: = range words[:p os] {FMT. PRINTLN (Word)} "view source files in Github [Scanner-word-list-grow.go] (https://github.com/kgrz/reading-files-in-go/blob/ MASTER/SCANNER-WORD-LIST-GROW.GO) Eventually we can achieve fewer "amplification" operations, but at the same time we may have some empty slots at the end, depending on ' buffersize ', which is a compromise approach. # # Divides a long string into multiple words ' bufio. Scanner ' has a parameter that this parameter is implemented with ' IO. The type of Reader ' interface, which means that the type can be any type that has a ' Read ' method. In the standard library ' strings. The Newreader ' function is a string utility that returns a type of "reader". We can combine the two together using: "' gofile, err: = OS. Open ("_config.yml") longstring: = "This is a very long string. Not. " Handle (ERR) var words []stringscanner: = Bufio. Newscanner (Strings. Newreader (longstring)) scanner. Split (Bufio. Scanwords) for scanner. Scan () {words = append (words, scanner. Text ())}fmt. Println ("word list:") for _, Word: = Range words {fmt. PRINTLN (Word)} ' # # reads a comma-delimited string with the basic ' file '. Read () ' or ' Scanner ' type to parse CSV file/string seems too cumbersome because in ' Bufio. The word in scanwords ' function refers to a symbol separated by a Unicode space (runes). Reading a single symbol (runes) and continuously tracking the buffer size and position (as Lexing/parsing did) requires too much work and action. Of course, this can be avoided. We can fix it.A new partition function, this function reads the character know to encounter a comma, and then calls ' Text () ' or ' Bytes () ' to return the data block. ' Bufio. The function signature of Splitfunc ' is as follows: ' Go (data []byte, ateof bool)--(advance int, token []byte, err Error) ' 1. ' Data ' is the input byte string 2. ' Ateof ' is the Terminator flag 3 passed to the function. ' Advance ' uses it, and we can specify the number of positions to handle the current read length. This value is used to update the cursor position 4 after the scan cycle is complete. ' token ' refers to the actual data of the scan operation 5. ' Err ' You may want to return to the problem of finding simplicity, I will demonstrate reading a string instead of a file. A simple CSV reader using the above signature is as follows: "' gocsvstring: =" name, age, occupation "//define an anonymous function to avoid repeating the main () function scancsv: = func (data []byte, a teof bool) (advance int, token []byte, err Error) {commaidx: = bytes. Indexbyte (data, ', ') if Commaidx > 0 {//We need to return to the next position buffer: = Data[:commaidx]return commaidx + 1, bytes. Trimspace (buffer), nil}//if the end of the string is encountered, then directly return the entire buffer if ateof {//The following code only executes when there is data, otherwise it may mean that the end of the input CSV string is reached if Len (data) > 0 {return len (data), bytes. Trimspace (data), nil}}//returns 0, nil, nil is a signal that allows the interface to read more data from the input source. In this case, the input source is a string reader, which is largely unlikely to be encountered. return 0, nil, nil}scanner: = Bufio. Newscanner (Strings. Newreader (csvstring)) scanner. Split (SCANCSV) for scanner. Scan () {fmt. Println (ScannEr. Text ())} "view source file in Github [Comma-separated-string.go] (https://github.com/kgrz/reading-files-in-go/blob/master/ COMMA-SEPARATED-STRING.GO#L10) # # Ruby Style We have listed many ways to read files in the order of convenience and functionality added once. What if we just want to read a file into the buffer? The ' Ioutil ' package in the standard library contains some simpler functions. # # Read the entire file ' gobytes, err: = Ioutil. ReadFile ("_config.yml") if err! = Nil {log. Fatal (Err)}fmt. PRINTLN ("Bytes read:", Len (Bytes)) fmt. PRINTLN ("string read:", string (bytes) ")" looks more like the way some advanced scripting languages are written. # # Read all the files under this folder Needless to say, if you have very large files, * do not * run this script:D ' gofilelist, err: = Ioutil. ReadDir (".") If err! = Nil {log. Fatal (Err)}for _, FileInfo: = range FileList {if FileInfo. Mode (). Isregular () {bytes, err: = Ioutil. ReadFile (FileInfo. Name ()) if err! = Nil {log. Fatal (Err)}fmt. PRINTLN ("Bytes read:", Len (Bytes)) fmt. PRINTLN ("string read:", string (bytes))}} ' # # More help methods There are many functions in the standard library that read the file (specifically, the reader). To prevent this lengthy article from becoming more verbose, I have listed some of the functions I found: 1. ' Ioutil. ReadAll (), using an IO-like object, returns the byte array 2. ' Io. Readfull () ' 3. ' Io. Readatleast () ' 4. ' Io. Multireader ' A very useful basic tool for merging multiple class IO objects (primitive). You can put multiple textsis treated as a contiguous block of data without having to deal with complex operations that switch to another file object after the last file ends. # # Update I tried to emphasize the ' read ' function, I chose to use the error function to print and close the file: ' ' ' Gofunc handlefn (file *os. File) func (Error) {return func (err error) {if err! = Nil {file. Close () log. Fatal (ERR)}}}//within the main function: file, err: = OS. Open ("Filetoread.txt") Handle: = Handlefn (file) handle (ERR) ", I ignored an important detail: If no error occurs and the program finishes running, the file will not be closed. If the program runs multiple times and no errors occur, the file descriptor is compromised. This issue is already in [on Reddit by U/shovelpost] (https://www.reddit.com/r/golang/comments/7n2bee/various_ways_to_read_a_file_in _go/drzg32k/). I don't want to use ' defer ' because of ' log '. Fatal ' internal calls ' OS. Exit ' function, and the function does not run the deferred function, so I chose to manually close the file, but ignored the normal running situation. I have used ' defer ' in the updated example and replaced ' OS ' with ' return '. Exit () '.

Via:http://kgrz.io/reading-files-in-go-an-overview.html

Author: Kashyap Kondamudi Translator: Killernova proofreading: Unknwon

This article by GCTT original compilation, go language Chinese network honor launches

This article was originally translated by GCTT and the Go Language Chinese network. Also want to join the ranks of translators, for open source to do some of their own contribution? Welcome to join Gctt!
Translation work and translations are published only for the purpose of learning and communication, translation work in accordance with the provisions of the CC-BY-NC-SA agreement, if our work has violated your interests, please contact us promptly.
Welcome to the CC-BY-NC-SA agreement, please mark and keep the original/translation link and author/translator information in the text.
The article only represents the author's knowledge and views, if there are different points of view, please line up downstairs to spit groove

652 Reads
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.