This is a creation in Article, where the information may have evolved or changed.
MapReduce is a software architecture proposed by Google for parallel operations in large-scale datasets (larger than 1TB). In short, it is to divide the task into very small tasks and then the execution of a district to the final summary, it is like when our teacher often teach us the same, punches, trivial (instantly feel the teacher is very concise AH!!! Thought is such a thought, so according to this idea in the modern software definition of all the world, how we use such a way to solve the massive data processing, this will tell you a simple implementation of this to use the Go language.
Get in the car
Briefly introduce several concepts:
Concepts such as "map" and "Reduce", and their main ideas, are borrowed from functional programming languages and are borrowed from vector programming languages. The current software implementation is to specify a map function that maps a set of key-value pairs into a new set of key-value pairs, specifying the concurrency Reduce (induction) function, which is used to guarantee that each of the mapped key-value pairs share the same set of keys.
With an example for the simple start:
The statistic of the word frequency (worldcount), on the actual needs of the above possible we may have such a requirement, is to calculate the number of words in an article appears. Concrete to life is, even if top n results, such as the whole school to open a commendation convention, to find 10 good students such as Top N Such examples abound, and world Count is one of his implementation, but the final result only to take out the results in front.
With the above to find out the needs of 10 good students, we come to think about how to achieve it, it is clear that this demand may be the headmaster in the meeting, then the specific implementation is that each grade leader is not to each grade ranked the top 10 students to find out, and then the leader of the grade leader, This information in the summary of the first 10 of the students, then the specific level of each grade how to do? In the same vein, the top 10 students of each class are found and aggregated into the grade department.
The departure
Basic overview and ideas already understood, and now began to build the entire MapReduce framework, first of all we make clear that the idea is to divide the task into the appropriate size, then calculate it, and then each step of the results of the calculation, a summary of the process of merging. Then the two processes are defined separately as map and reduce processes.
Or take the world Count as an example:
The process of MAP processing is to read the given file and initialize the occurrence frequency of each word inside the file to 1.
The processing of Reduce is the process of adding the same words and data to an accumulation. Well, the purpose of our MapReduce framework is to invoke the process of calling this Map and Reduce at the right time.
In Common_map.go inside the Domap method is the given file, read the data and then call map this process, the code contains comments, here is a brief overview of the main there are these steps:
- Read files;
- Will read the contents of the file, call the user Map function, production for the KeyValue value;
- Finally, according to the Key inside the KeyValue partition, the content is written into the file, in order to facilitate the subsequent Reduce process execution;
Func Domap (JobName string,/////The name of the MapReduce jobmaptasknumber int,//which map task this isinfile string,n reduce int,//The number of the reduce task that would be run Mapf func (file string, contents string) []keyvalue,) {//SETP 1 R EAD filecontents, err: = Ioutil. ReadFile (inFile) if err! = Nil {log. Fatal ("Do map error for InFile", err)}//SETP 2 Call User User-map method, to get Kvkvresult: = Mapf (InFile, String (content s))/** * SETP 3 use key of KV generator nreduce file, Partition * A. Create Tmpfiles * B. Create encoder for T Mpfile to write contents * C. Partition by key, then write Tmpfile */var tmpfiles [] *os. File = make ([] *os. File, Nreduce) var encoders [] *json. Encoder = make ([] *json. Encoder, Nreduce) for I: = 0; i < nreduce; i++ {tmpfilename: = Reducename (jobname,maptasknumber,i) tmpfiles[i],err = os. Create (tmpfilename) if Err!=nil {log. Fatal (ERR)}defer tmpfiles[i]. Close () encoders[i] = json. Newencoder (Tmpfiles[i]) if Err!=nil {log. Fatal (ERR)}}for_, KV: = Range Kvresult {hashkey: = Int (Ihash (KV). Key))% Nreduceerr: = Encoders[hashkey]. Encode (&KV) if Err!=nil {log. Fatal ("Do map encoders", Err)}}
Doreduce function in Common_reduce.go, the main steps:
- Read the intermediate files produced during the DOMAP process;
- Follow the Key in the same file to read the new Order in the dictionary sequence;
- Iterates through the read KeyValue, and calls the user's Reduce method to continue writing the computed results to the file;
Func doreduce (JobName string,//The name of the whole MapReduce jobreducetasknumber int,//which reduce task this isnmap int,//The number of map tasks that were run ("M" in the paper) Reducef func (key string, values []string) string,) {//fi Le. Close ()//setp 1,read map generator file, same key merge put map[string][]stringkvs: = Make (map[string][]string) for I: = 0; i < NMap; i++ {fileName: = Reducename (JobName, I, reducetasknumber) file, err: = OS. Open (fileName) if err! = Nil {log. Fatal ("DoReduce1:", err)}dec: = json. Newdecoder (file) for {var kv keyvalueerr = Dec. Decode (&KV) if err! = Nil {break}_, OK: = kvs[kv. Key]if!ok {kvs[kv. Key] = []STRING{}}KVS[KV. Key] = append (kvs[kv. Key], KV. Value)}file. Close ()}var keys []stringfor k: = range Kvs {keys = append (keys, k)}//setp 2 sort by keyssort. Strings (keys)//SETP 3 Create result Filep: = Mergename (JobName, reducetasknumber) file, err: = OS. Create (P) if err! = Nil {log. Fatal ("Doreduce2:ceate", err)}enc: = json. Newencoder (file)//setp 4 CAll the user reduce each key of kvsfor _, K: = Range Keys {res: = Reducef (k, kvs[k]) enc. Encode (Keyvalue{k, res})}file. Close ()}
Merge process
Of course, in the end is the result of each Reduce produced a merge process, in the process of merge, also need to follow the Key in the dictionary order, and then write to the final file. The code is similar to reduce, and there is no self-love to repeat it.
Using go multithreading to achieve distributed task execution, here is mainly schedule.go inside the schedule method, mainly the steps:
- Through the different stages (map or Reduce), get to how many Map (reduce) needs to be executed, and then call the remote Worker.go inside the Dotask method;
- Wait for all the tasks to complete before they end. Some features of the go language are used here, go RPC documentation and Concurrency in go.
func (mr *Master) schedule(phase jobPhase) {var ntasks intvar nios int // number of inputs (for reduce) or outputs (for map)switch phase {case mapPhase:ntasks = len(mr.files)nios = mr.nReducecase reducePhase:ntasks = mr.nReducenios = len(mr.files)}fmt.Printf("Schedule: %v %v tasks (%d I/Os)\n", ntasks, phase, nios)//use go routing,worker rpc executor task,done := make(chan bool)for i := 0; i < ntasks; i++ {go func(number int) {args := DoTaskArgs{mr.jobName, mr.files[ntasks], phase, number, nios}var worker stringreply := new(struct{})ok := falsefor ok != true {worker = <- mr.registerChannelok = call(worker, "Worker.DoTask", args, reply)}done <- truemr.registerChannel <- worker}(i)}//wait for all task is complatefor i := 0; i< ntasks; i++ {<- done}fmt.Printf("Schedule: %v phase done\n", phase)}
The station.
- Test Inverted Index Results:
SOURCE Warehouse Address
Https://github.com/happyer/distributed-computing