This is a creation in Article, where the information may have evolved or changed.
The first day of the New Year's holiday, at home nothing to do, with Golang realized a single process version of MapReduce, GitHub address. Handle 10 words for the highest frequency of large file statistics, because the function is relatively simple, so the design is not decoupled.
This article introduces the concept of MapReduce in general, and then introduces the code, and if it is available for the next few days, I will implement the distributed high-availability MapReduce version.
1. MapReduce General Architecture
Is the general structure of MapReduce in the paper. In general, the idea of MapReduce is divided thinking: The data is fragmented, and then processed with Mapper, in order to output intermediate files in key-value form; then merge the intermediate files of the mapper output with the reducer: combine the keys uniformly together, and output the result file, and if necessary, use combiner for the final merger.
Induction is divided into 5 parts: User program, Master, Mapper, Reducer, combiner (not given).
- User program. The user program mainly divides the input data and formulates the code of Mapper, Reducer and combiner.
- Master: Central control System. Control the number of distribution mapper, Reduer, such as generating m process processing mapper,n process processing reducer. In fact, for master, mapper and Reduer belong to the worker, but run the program is not the same, mapper run the user input map code, Reduer run the user input reduce code. Master also acts as a conduit for intermediate path passing, such as passing intermediate files generated by mapper to Reduer, returning the resulting file Reduer generated, or passing it to combiner if necessary. Because Master is a single point, performance bottleneck, you can do the cluster: Primary or Standby mode or distributed mode. Zookeeper can be used to select the main, with some message middleware for data synchronization. Master can also do some policy processing: for example, a worker takes a very long time to execute, it is likely to get stuck, and the data assigned to that worker is reassigned to another worker, which of course needs to be re-processed for multiple copies of data.
- Mapper: Responsible for cutting input data into key-value format. After processing the mapper, the path of the intermediate file is communicated to the master,master and communicated to the Reduer for subsequent processing. If the mapper has not been processed, or if it has finished processing but Reduer has not finished reading the output file, the input assigned to the Mapper will be re-executed by another mapper.
- Reducer: Accepts the message that master sends the mapper output file, RPC reads the file and processes it, and outputs the resulting file. n Reduer will produce n output files.
- Combiner: It is not usually necessary to do the final merge process.
Overall, the architecture is not complex. Communication between components can be anything, such as RPC, HTTP, or private protocol.
2. Implementation code Introduction
This version of the code to achieve a single process version, Mapper, reducer and combiner implementation of the Goroutine implementation of the association, communication using channel. The code is written at random, without decoupling.
- Function: Count the 10 words of the highest frequency appearing in a given file
- Input: Large file
- Output: 10 Words of the highest frequency
- Implementation: 5 Mapper, 2 reducer, 1 combiner.
For the sake of convenience, combiner to the maximum frequency of 10 words heap sorting processing, according to the specification should be placed in the User program processing.
The file directory is as follows, where the bin folder big_input_file.txt
for the input file, can be called generate
under the main file generated, caller file for the portal user program, the master directory is stored in master, Mapper, Reducer, Combiner Code:
.├── README.md├── bin│ └── file-store│ └── big_input_file.txt└── src ├── caller │ └── main.go ├── generate │ └── main.go └── master ├── combiner.go ├── mapper.go ├── master.go └── reducer.go6 directories, 8 files
2.1 Caller
The user program reads the file and divides it by a fixed number of lines, and then calls master.Handle
for processing.
Package Mainimport ("OS" "Path" "Path/filepath" "Bufio" "StrConv" "Master" "github.com/vinllen/go- Logger/logger ") const (Limit int = 10000//The limit line of every file) Func main () {curDir, err: = FILEPATH.A BS (filepath. Dir (OS. Args[0])) if err! = Nil {logger. Error ("Read path Error:", err.) Error ()) return} Filedir: = path. Join (CurDir, "File-store") _ = os. Mkdir (Filedir, OS. Modeperm)//1. Read file FileName: = "Big_input_file.txt" inputfile, err: = OS. Open (path. Join (filedir, filename)) if err! = Nil {logger. Error ("Read inputfile error:", err.) Error ()) return} defer Inputfile.close ()//2. Split Inputfile into several pieces this every piece hold 100,000 lines filepiecearr: = []string{} scanner: = Bufio. Newscanner (inputfile) Piece: = 1outter:for {outputfilename: = "Input_piece_" + StrConv. Itoa (piece) Outputfilepos: = path. Join (Filedir, OutputFileName)Filepiecearr = Append (Filepiecearr, Outputfilepos) outputFile, err: = OS. Create (Outputfilepos) if err! = Nil {logger. Error ("Split inputfile error:", err.) Error ()) Continue} defer outputfile.close () for CNT: = 0; CNT < LIMIT; cnt++ {if!scanner. Scan () {Break Outter} _, Err: = Outputfile.writestring (scanner. Text () + "\ n") if err! = Nil {logger. Error ("Split inputfile writting Error:", err.) Error ()) return}} piece++}//3. Pass to master res: = Master. Handle (Filepiecearr, Filedir) logger. Warn (RES)}
2.2 Master
The master program, in turn, generates Combiner, Reducer, Mapper, and processes the message relay, outputting the final result.
Package Masterimport ("Github.com/vinllen/go-logger/logger") var (mapchanin Chan Mapinput//channel produced by Master while consumed by mapper Mapchanout Chan string//channel produced by Mapper while consumed by master Reduce Chanin Chan string//channel produced by Master while consumed by reducer Reducechanout Chan string//channel produce D by reducer and consumed by master Combinechanin Chan string/channel produced by master while consumed by combine R Combinechanout Chan []item//channel produced by Combiner and consumed by master) Func Handle (Inputarr []string, fi Ledir string) []item {logger. Info ("handle called") const (Mappernumber int = 5 Reducernumber int = 2) Mapchanin = Make (chan Map Input) Mapchanout = Make (chan string) Reducechanin = Do (chan string) reducechanout = Make (Chan string) Combi Nechanin = Make (chan string) combinechanout = Make (chan []item) Reducejobnum: = Len (Inputarr) combinejObnum: = reducernumber//start combiner go combiner ()//Start reducer for I: = 1; I <= Reducernumber; i++ {Go reducer (i, Filedir)}//Start mapper for I: = 1; I <= Mappernumber; i++ {Go mapper (I, Filedir)} go func () {For I, V: = Range (Inputarr) {Mapchanin <-Map input{filename:v, Nr:i + 1,}//Pass job to Mapper} close (MAPC Hanin)//close map input channel when no more Job} () var res []itemoutter:for {select {c ASE V: = <-mapchanout:go func () {reducechanin <-v reducejob num--if Reducejobnum <= 0 {close (reducechanin)} } () Case V: = <-reducechanout:go func () {Combinechanin <-V combinejobnum-- If Combinejobnum <= 0 {close (combinechanin)}} () Case V: = <-Combinechanout:res = v break Outter}} close (Mapchanout) Close (reducechanout) Close (combinechanout) return res}
2.3 Mapper
Mapper program, read in and generate intermediate files in key-value format, inform Master.
Package Masterimport ("FMT" "Path" "OS" "Bufio" "StrConv" "Github.com/vinllen/go-logger/logger") type mapinput struct {Filename string nr int}func mapper (nr int, filedir string) {for {val, ok: = <- Mapchanin//Val:filename if!ok {//channel close break} inputfilename: = val. Filename nr: = val. Nr file, err: = OS. Open (inputfilename) if err! = Nil {errmsg: = FMT. Sprintf ("Read file (%s) error in mapper (%d)", InputFileName, nr) logger. Error (errmsg) mapchanout <-"" continue} MP: = Make (Map[string]int) scanner : = Bufio. Newscanner (file) scanner. Split (Bufio. Scanwords) for scanner. Scan () {str: = scanner. Text ()//logger. Info (str) mp[str]++} outputfilename: = path. Join (Filedir, "mapper-output-" + StrConv. Itoa (NR)) Outputfilehandler, err: = OS. Create (OutputfiLename) If err! = Nil {errmsg: = FMT. Sprintf ("Write file (%s) error in mapper (%d)", OutputFileName, nr) logger. Error (errmsg)} else {for k, V: = Range mp {str: = FMT. Sprintf ("%s%d\n", K, v) outputfilehandler.writestring (str)} outputfilehandler.close ()} Mapchanout <-OutputFileName}}
2.4 Reducer
Reducer The program, read the intermediate file passed by master and merge it.
Package Masterimport ("FMT" "Bufio" "OS" "StrConv" "Path" "Strings" "github.com/vinllen/go-logger/ Logger ") Func reducer (nr int, Filedir string) {mp: = Make (Map[string]int)//store The frequence of words//Read File and do reduce for {val, ok: = <-Reducechanin if!ok {break} LOGGER.D Ebug ("Reducer called:", nr) file, err: = OS. Open (val) if err! = Nil {errmsg: = FMT. Sprintf ("Read file (%s) error in Reducer", Val) logger. Error (ERRMSG) Continue} scanner: = Bufio. Newscanner (file) for scanner. Scan () {str: = scanner. Text () Arr: = Strings. Split (str, "") If Len (arr)! = 2 {errmsg: = FMT. Sprintf ("Read file (%s)" error that Len of line (%s)! = 2 (%d) in Reducer ", Val, str, Len (arr)) logger. Warn (errmsg) Continue} V, err: = StrConv. Atoi (Arr[1]) If err! = Nil {errmsg: = FMT. Sprintf ("Read file (%s) error ', ' line (%s) ' Parse error in Reduer", Val, str) logger. Warn (errmsg) Continue} Mp[arr[0]] + = v} If err: = Scanner. ERR (); Err! = Nil {logger. Error ("reducer:reading standard input:", err)} file. Close ()} outputfilename: = path. Join (Filedir, "reduce-output-" + StrConv. Itoa (NR)) Outputfilehandler, err: = OS. Create (outputfilename) if err! = Nil {errmsg: = FMT. Sprintf ("Write file (%s) error in reducer (%d)", OutputFileName, nr) logger. Error (errmsg)} else {for k, V: = Range mp {str: = FMT. Sprintf ("%s%d\n", K, v) outputfilehandler.writestring (str)} outputfilehandler.close ()} R Educechanout <-OutputFileName}
2.5 combiner
The Combiner program reads the reducer result file passed by master and merges it into one, and then the heap sorts the 10 words that output the highest frequency.
Package Masterimport ("FMT" "Strings" "Bufio" "OS" "Container/heap" "StrConv" github.com/vinllen/ Go-logger/logger ") type Item struct {key string Val int}type Priorityqueue []*itemfunc (PQ Priorityqueue) Len () in t {return Len (PQ)}func (PQ priorityqueue) Less (i, J int) bool {return pq[i].val > Pq[j].val}func (PQ Priori Tyqueue) Swap (i, J int) {Pq[i], pq[j] = Pq[j], Pq[i]}func (PQ *priorityqueue) Push (x interface{}) {Item: = x. ( *item) *PQ = append (*PQ, Item)}func (PQ *priorityqueue) Pop () interface{} {old: = *pq N: = Len (old) Item: = Old[n-1] *pq = old[0:n-1] return Item}func Combiner () {MP: = Make (Map[string]int)//Store the Frequenc E of words//read file and do combine for {val, ok: = <-Combinechanin if!ok {break } logger. Debug ("combiner called") file, err: = OS. Open (val) if err! = Nil {errmsg: = FMT. Sprintf ("Read filE (%s) error in Combiner ", Val) logger. Error (ERRMSG) Continue} scanner: = Bufio. Newscanner (file) for scanner. Scan () {str: = scanner. Text () Arr: = Strings. Split (str, "") If Len (arr)! = 2 {errmsg: = FMT. Sprintf ("Read file (%s)" error that Len of line! = 2 (%s) in Combiner ", Val, str) logger. Warn (errmsg) Continue} V, err: = StrConv. Atoi (arr[1]) if err! = Nil {errmsg: = FMT. Sprintf ("Read file (%s) error ', ' line (%s) ' Parse error in Combiner", Val, str) logger. Warn (errmsg) Continue} Mp[arr[0]] + = v} file. Close ()}//Heap sort//PQ: = Make (Priorityqueue, Len (MP)) PQ: = Make (priorityqueue, 0) heap. Init (&PQ) for k, V: = Range MP {node: = &item {key:k, val:v,}/ /logger. Debug (k, v) Heap. Push (&PQ, node)} res: = []item{} for I: = 0; I < && PQ. Len () > 0; i++ {node: = heap. Pop (&PQ). (*item) res = append (res, *node)} combinechanout <-Res}
3. Summary
Insufficient and not implemented:
- High coupling between the modules
- Master single point failure not extended
- No multi-process implementation with RPC communication between processes
- The code that does not implement a single workder time is too long for another worker to perform the task.
Next time I'm free, I'll implement distributed, highly available code, with RPC communication between modules.
Description
Reprint Please specify source: http://vinllen.com/golangshi-xian-mapreducedan-jin-cheng-ban-ben/
Reference
Https://research.google.com/archive/mapreduce.html