MapReduce: Simple data processing on Super large cluster Summary MapReduce is a programming model, and processing, produces a large dataset related to the implementation. The user specifies a The map function handles a key/value pair, resulting in a middle key/value pair. Then specify a reduce function to merge all the middle keys with the same middle key Value. Many of the real-world jobs that can be represented by this model are listed below. Programs written in this way can be automatically parallelization on large, general-purpose machines. This run-time system cares about these details: segmenting input data, scheduling on a cluster, machine error handling, Manage the necessary communication between machines. This allows programmers with no experience in parallel distributed processing systems to take advantage of the resources of a large number of distributed systems. Our MapReduce implementation runs on a cluster of ordinary machines that can be flexibly adjusted in size, a A typical mapreduce calculates TB-calculated data on thousands of machines. Programmers find the system very handy: Hundreds of MapReduce programs have been implemented every day Google's fleet has more than 1000 mapreduce programs in execution. 1. In the past 5 years, many of the authors and Google have implemented hundreds of specially written calculations to handle a large amount of raw data, such as crawling documents, Web request logs, And so on. To compute the various types of derived data, for example, the inverted index, the various representations of the graph structure of the Web document, the summary of the number of pages crawling on each host, the maximum number of requests per day, and so on. Many of these calculations are conceptually easy to understand. However, the amount of data entered is large, And only the calculations are distributed in hundreds of thousands of machines to be completed in an acceptable time. How to compute, distribute, and handle errors in parallel, all of which combine to make a very concise calculation, because a lot of complex code to deal with these problems, And it becomes difficult to deal with. As a response to this complexity, we design a new abstract model that allows us to express what we are going to doSimple computing, while hiding the messy details of parallelization, fault tolerance, data distribution, load balancing, in a library. Our abstract model is inspired by the original representations of both the map and reduce of Lisp and many other functional languages. We realize that many of our calculations contain such an operation: The map operation is applied to the logical record of the input data to compute an intermediate key/value pair set, and the reduce operation is applied to all value with the same key to properly merge the derived data. The use of functional models, combined with user-specified map and reduce operations, Let's make it very easy to implement large-scale parallelization, and use it again as a primary mechanism for fault tolerance. The main contribution of this work is to automate parallelization and large-scale distributed computing through simple and powerful interfaces, The implementation of this interface is implemented to achieve high-performance computing on a large number of ordinary PC computers. The second part describes the basic programming model and gives some examples. The third part describes the clustering based computing that fits our The implementation of the MapReduce interface of the environment. Part Four describes some of the techniques that we find useful in the programming model. Part V for a variety of tasks, measuring the performance of our implementation. Part VI explores Use MapReduce as a base in Google to rewrite our index system products. Part seventh discusses related and future work. 2. Programming model calculations use an input key/value pair set to produce an output key/value pair set. Users of the MapReduce library use two functions to express this calculation: map and reduce. User-defined map functions that accept an input pair and then produce an intermediate key/value pair set. The MapReduce library aggregates all intermediate value with the same intermediate key I and passes them to the reduce function. The user-defined reduce function, which accepts an intermediate key I and a value set associated with it. It merges these value, To form a relatively small value set. Generally, each reduce call produces only 0 or 1 output value. The intermediate value is provided to the user by an iterator for a custom reduce function. This allows us to control the size of the value list based on memory. 2.1 Example considers this problem: calculates the number of occurrences of each word in a large collection of documents. The user will write a pseudo code similar to the following: Map (String key,string value): //key: Document name //value: Document contents for each word w in value: emitintermediate (w, "1"); reduce (String key,iterator values)://key: A word//values: A list of counts int result=0; for each V in values: Result+=parseint (v); emit (Asstring (Resut)); The map function produces each word and the number of occurrences of the word (in this simple case, 1). The reduce function adds the count of each particular word that is produced. In addition, the user populates a MapReduce specification object with the name of the input and output file and an optional tuning parameter. The user then calls the MapReduce function, and pass the specification object to it. The user's code is linked with the MapReduce library (implemented in C + +). Appendix A contains all the text of this instance. 2.2 Types even if the previous pseudocode was written in the term format of the string input and output, But conceptually the user-written map and reduce functions have associated types: map (K1,V1)->list (k2,v2) reduce (K2,list (v2))->list (v2) For example, the input key, Value is different from the Key,value field of the output. In addition, the middle Key,value and output key,values are the same.
Our C + + implementation passes strings to interact with user-defined functions and leaves it to the user's code to convert between strings and appropriate types.
2.3 More examples Here are some interesting simple programs that can be easily represented by MapReduce calculations. Distributed grep (Unix utility, a string lookup within a file): If the input row matches the given style, The map function prints this line. The reduce function is to copy the intermediate data to the output. Compute URL Access frequency: The map function processes the records of Web page requests, output (url,1). The reduce function adds the value of the same URL to produce a (URL, total number of records) The right. Inverted Network link graph: Map function for each link output (target, source) pair, a URL is called the target, the page containing this URL is called the source. The reduce function joins all source URLs into a list based on the given related target URLs, generating (target, source list) The term vector for each host: a term vector uses a list of words, frequencies, to outline the most important words that appear in a document or a set of documents. The MAP function produces one (hostname, term vector) pair for each input document (host name from the document URL). The reduce function receives a term vector for all documents for a given host. It adds these terms vectors together, discards the low-frequency term, and then produces a final (hostname, term vector) pair. Inverted index: The map function analyzes each document and then produces one (Word, document number) A sequence of pairs. The reduce function accepts all pairs of a given word, sorts the corresponding document IDs, and produces a pair of (Word, document ID list) pairs. All output pairs form a simple inverted index. It can simply increase the computation of the tracking word position. Distributed sorting: The map function extracts the key from each record and produces a (Key,record) pair. The reduce function does not change any pairs. This computation relies on the partitioning tool (described in 4.1) and the sort attributes (described in 4.2). 3 implementation of the MapReduce interface may have many different implementations. Make the right choice according to the environment. For example, a machine that implements a smaller share of memory is appropriate, and another suitable for a large NUMA multiprocessor machine, And some are suitable for a larger set of network machines. This section describes the implementation of a computing environment that is widely used by Google: a large cluster of ordinary PCs connected by a switch. Our environment is: 1.Linux operating system, dual processor, 2-4GB memory machine. 2. Common network hardware, The bandwidth of each machine is either hundred megabytes or gigabit, but the average is less than half of the total bandwidth. 3. Because a cluster contains hundreds of machines, All machines often have problems. 4. Storage uses a cheap IDE hard disk that is directly connected to each machine. A distributed file system developed from an internal filesystem is used to manage the data stored on these disks. File systems are replicated inUnreliable hardware to ensure reliability and effectiveness. 5. The user submits the work to the dispatch system. Each job contains a set of tasks, each of which is mapped to an available set of machines in the cluster. 3.1 execution preview by automatically segmenting input data into a set with M split, Map calls are distributed across multiple machines. The input split can be processed in parallel on different machines. By dividing the middle key with the partition function, the R slice (for example, the hash (key) mod R) is formed, and the reduce call is distributed across multiple machines. Split Quantity (R) And split functions are specified by the user. Figure 1 shows the entire process of the mapreduce operation we have implemented. When a user's program invokes a MapReduce function, the following sequence of actions occurs (the following number corresponds to the numeric label in Figure 1): 1. The MapReduce library in the user program first splits the input file into m slices, and the size of each slice is generally from 16 to 64MB (the user can control it by optional parameters). And then start a lot of copy programs in the cluster . 2. One of these copies of the program is master, others are the worker assigned by master to the task. Have M The map task and the R reduce task will be assigned. The manager assigns a map task or a reduce task to an idle worker.3. A worker is assigned a map task to read the related input split content. It analyzes the key/from the input data value pairs, and then the Key/value is passed to the user-defined map function. The intermediate key/value pairs generated by the map function are cached in memory. 4. Cached in-memory Key/value is periodically written to the local disk. They are written to the R area by the partition function. The location of the cache pair on the local disk is sent to Master,master responsible for delivering these locations to reduce worker.5. When a reduce worker gets the location notice of master, It uses remote procedure tuning to read cached data from the disk of map worker. When reduce worker reads all the intermediate data, it aggregates the content with the same key by sorting. Because many different keys map to the same reduce task, So sorting is a must. If the intermediate data is larger than memory, then an external sort . 6.reduce worker iterative intermediate data is required, for each unique medium encounteredKey, which passes the key and the associated intermediate value set to the user's custom reduce function. The output of the reduce function is added to the final output file of the reduce partition. 7. When all the map and reduce tasks are complete, The manager wakes up the user program. At this time, the MapReduce call in the user program is returned to the user code. After successful completion, the output of the mapreduce is stored in the R output file (each reduce task produces a file with a user-specified name). Generally, Users do not need to merge this r output file into a single file-they often pass these files as an input to other mapreduce calls, or use them in distributed applications that can handle multiple split files. 3.2master Data structure Master maintains some data structures. It stores their status (idle, work, complete) for each map and reduce task, and a worker machine (a machine that is not an idle task) Identification.
Master is like a pipe, through which the location of the intermediate file region is passed from the map task to the reduce task. Therefore, for each completed map task, master stores the size and location of the file area of R that is produced by the map task. When the map task is completed, the position and The size of the update information is accepted. This information is gradually being passed on to those who are working on the reduce task.
3.3 Fault tolerance because the MapReduce library is designed to use hundreds of machines to help with very large data, So this library must be able to deal with machine faults very well. Worker Failure master periodically ping each worker. If master does not receive the information returned by the worker within a certain time period, Then it will mark the worker as invalid. Because each of the map tasks completed by this failed worker is reset to its initial idle state, it can be arranged to other worker. Again, Every map or reduce task that is running on a failed worker is also reset to idle and will be scheduled for rescheduling. A map task that has been completed on a failed machine is executed again because its output is stored on its disk. So inaccessible. The reduce task that has been completed will not be executed again because its output is stored in the global file system. When a map task was first executed by worker A, it was executed by B (because a failed), Re-executing this condition is notified to all the worker who performed the reduce task. Any reduce task that does not have read data from a will read data from worker B. MapReduce can handle large-scale worker failures. For example, in a During the MapReduce operation, network maintenance on the running cluster caused 80 machines to be inaccessible within minutes, and MapReduce master simply performed the work that had been done by the unreachable worker, continued, The final completion of this MapReduce operation. Master failure makes it easy for managers to write the checkpoints of the data structure described above. If this master task fails, You can start another master process from the last checkpoint. However, because there is only one master, so its failure is rather troublesome, so our implementation now is, if master fails, Abort the MapReduce calculation. The customer can check this state and can rerun the MapReduce operation as needed. The processing mechanism in front of the error our distributed implementation occurs when the user provides a map and reduce operation that is a defined function of its output value, The same output as all programs do not have the wrong order. We rely on atomic submissions for the output of the map and reduce tasks to do this. The task in each job writes its output to a private temporary file. A reduce task produces one of these files, A map task produces a file such as R (a reduce task corresponds to a file). When aWhen a map task completes, the worker sends a message to master containing the name of the temporary file in this message. If master receives a completed message again from a completed map task, it ignores the message. Otherwise, It records the name of the R file in master's data structure. When a reduce task completes, the reduce worker atom renames the temporary file to the final output file. If the same reduce task executes on more than one machine, multiple rename calls are executed, and produce the same output file. We rely on the atomic rename operation provided by the underlying file system to ensure that the final filesystem state contains only the data produced by a reduce task. Our map and reduce operations are mostly determined, and our processing mechanism is equivalent to a sequential The implementation of this fact makes it easy for programmers to understand the behavior of the program. When the map or/and reduce operations are uncertain, we offer a weaker but reasonable processing mechanism. When in front of an indeterminate operation, The output of a reduce task R1 is equivalent to the output produced by an indeterminate sequential program execution. However, a different reduce task R2 output may conform to the output produced by a different, uncertain sequential program execution. Consider map tasks m and reduce task R1, R2. We set E (RI) for the execution of the already committed RI (with and only one such execution). This relatively weak semantics arises because E (R1) may have read the output produced by the execution of M, while E (R2) The output generated by the different execution of M may have been read. 3.4 storage location in our computer environment, network bandwidth is a fairly scarce resource. We use the input data (which is managed by GFS) to store the network bandwidth on the local disk of the machine. GFS divides each file into blocks of 64MB, and then several copies of each block are stored on different machines (typically 3 copies). MapReduce's master considers the location information of the input file and tries to arrange a map task on a machine containing related input data. If this fails, it attempts to arrange a map task near the input data for that task (for example, Allocated to a and containing input data blocks executed on a worker machine in a switch. When a large MapReduce operation is performed on a part of a machine in a cluster, most of the input data is locally read, thus eliminating network bandwidth. 3.5 task granularity as described above , we subdivide the map phase into m slices, and reduce stage into r slices. M and r should be much larger than the number of worker machines. Each worker performs many different jobs to improve the dynamicsLoad balancing can also accelerate recovery from a worker failure, many of the completed map tasks on this machine can be assigned to all other worker machines. In our implementations, the range of M and R is limited in size because master must do O (m+r) scheduling, and save O (m*r) states in memory. (this factor uses very little memory, in O (m*r) status slices, approximately each map task/reduce task pairs using one byte of data. In addition, R is often limited by the user because each reduce task is ultimately a separate output file. In fact, we tend to choose M, So that each individual task is approximately 16 to 64MB of input data (so that the location optimization described above is the most effective), we set R to a small multiple of the number of worker machines we want to use. We often perform mapreduce calculations, in m=200000,r=5000, In the case of 2000 worker machines. 3.6 Standby Tasks A laggard is one of the reasons for prolonging the MapReduce operation time: A machine spends an unusually long time completing one of the final map or reduce tasks. There are many reasons why it is possible for a laggard to occur. A machine with a bad disk often has errors that can be corrected, reducing the read performance from 30mb/s to 3mb/s. Cluster scheduling system may have scheduled other tasks on this machine, due to the calculation to use CPU, memory, local disk, network bandwidth causes it to hold Line MapReduce code is slow. One of our recent problems is that a bug in machine initialization causes the processor cache to fail: The computational performance on an affected machine has a hundredfold effect.
We have a general mechanism to mitigate the problem of this laggard. When a mapreduce operation is about to be completed, master dispatches the standby process to perform the remaining tasks that are still being performed. Whether the original or standby execution is completed, The work is marked as complete. We have adjusted this mechanism, and usually only take a few more points of machine resources. We find that this can significantly reduce the time to complete large-scale mapreduce operations. As an example, the sort program that will be described in 5.3, in the case of shutting off the standby task, It takes 44% more time than a standby task.
4 Tips Although the simple map and reduce functions are sufficient for most requirements, we have developed some useful extensions. These will be described in this section. 4.1 Split function MapReduce the number of output files required by the user to specify the reduce task and the reduce task Use the Split function on the middle key to make the data split through these tasks. A default partition function uses the hash method (for example, hash (key) mod R). This leads to very balanced partitioning. And then, sometimes, It is useful to use other key partitioning functions to segment data. For example, sometimes the key to the output is the URL, and we want all entries for each host to remain in the same output file. To support situations like this, users of the MapReduce library can provide specialized split functions. For example, make Use the hash (Hostname (urlkey)) mod R as the split function to keep all URLs from the same host in the same output file. 4.2 Order guarantee we guarantee in a given partition, the middle key/ Value is processed in the order in which the key is incremented. This order ensures that each partition outputs an ordered output file, when the format of the output file needs to support the efficient random access key, or the output data set to be sorted again, it is easy to . 4.3combiner functions in some cases, allowing the intermediate result key repetition to occupy a considerable proportion, and the user-defined reduce function satisfies the binding law and the Exchange law. A good example is the word statistic program in 2.1. Because the word frequency tends to a zipf distribution (Zipf distribution), Each map task produces hundreds of such records <the,1> All of these counts will be transmitted over the network to a separate reduce task, and then the reduce function will add together to produce a number. We allow the user to specify an optional combiner function, First merge locally, then send it over the network. The Combiner function is executed on every machine that performs the map task. General, the same code is used In the combiner and reduce functions. The only difference between the combiner and the reduce function is how the MapReduce library controls the output of the function. The output of the reduce function is saved in the final output file. The output of the Combiner function is written to the intermediate file. It is then sent to the reduce task. Partial use of combiner can significantly improve the speed of some MapReduce operations. Appendix A contains an example using the Combiner function. 4.4 Input Output Type MapReduce library supports reading input in several different formatsAccording to. For example, text-mode input regards each row as a Key/value pair. Key is the offset of the file, and value is the content of that line. Other common support formats store key/value pairs of sequences in key order. The implementation of an input type knows how to split the input into a meaningful way for each individual map task (for example, the scope segmentation of the text pattern ensures that only the boundaries of each row are scoped). Although many users use only a few predefined input types of one, But users can support a new input type by providing a simple reader interface. A reader does not have to read data from a file. For example, we can easily define it to read records from a database or read from an in-memory data structure. 4.5 Side effects Sometimes, MapReduce users find it convenient to generate auxiliary files as an additional output during map operations or/and reduce operations. We rely on application writing to make this side effect atomic. Typically, the application writes a temporary file, and then once the file is fully generated, is automatically renamed. For multiple output files produced by a single task, we do not provide two-phase commit atomic operations support on it. Therefore, a task that produces multiple output files that require a cross file connection, The task of certainty should be made. But this limitation is not a problem in actual work. 4.6 Skip error logging sometimes because of bugs in the user's code, a map or reduce function on a record suddenly Crash. Such a bug makes the mapreduce operation impossible to complete. Although this bug is usually fixed, it is sometimes unrealistic; maybe the bug is in a third-party library where the source code is not available. Sometimes you can ignore some records, for example, When performing statistical analysis on a large dataset. We provide an optional execution mode, in which the MapReduce Library detects the crash of those records and then skips those records, To continue executing the program. Each worker program installs a signal processor to obtain memory segment exceptions and bus errors. Before invoking a user-defined map or reduce operation, the MapReduce library stores the recorded serial number in a global variable. If the user code produces a signal, The signal processor sends a "last gasp" UDP packet containing the serial number to the MapReduce master. When Master sees the same record more than once, it points out that when the associated map or reduce task is executed again, This record should be skipped. 4.7 Local execution debugging problems in a map or reduce function are difficult because the actual computation occurs in a distributed system, often with a mastEr dynamic distribution works to thousands of machines. To simplify debugging and testing, we developed a replaceable implementation that performs all mapreduce operations locally. The user can control execution so that the calculation can be limited to a specific map task. The user invokes their program with a flag, You can then easily use any debugging and testing tools that they consider useful (for example, GDB). 4.8 State Information Master runs an HTTP server and can output a set of status pages for people to use. The Status page displays the calculation progress, like how many tasks have been completed, and how many are still running, Number of bytes entered, intermediate data bytes, output bytes, processing percentages, and so on. This page also contains links to standard errors, and links to standard output generated by each task. The user can predict the amount of time it takes to calculate from these data and whether more resources are needed. When the calculation is much slower than expected, These pages can also be used to determine whether this is the case. In addition, the top status page shows how many workers have failed, and when they fail, the map and reduce tasks are running. When trying to diagnose bugs in user code, This information is also useful. 4.9 Counter MapReduce Library provides a counter tool to calculate the number of occurrences of a variety of events. For example, the user code wants to count the number of words that are processed or the number of German documents being indexed. To use this tool, user code creates a named counter object, Then add the counters appropriately in the map or/and reduce functions. For example: Counter * uppercase;uppercase=getcounter ("uppercase"); Map (String name,string Contents): for each word w in contents: if (iscapitalized (w)): Uppercase->increment (); emitintermediate (W, "1"); counter values from different worker machines are periodically transmitted to master (in the ping response) . master adds the counter values from the successful map and reduce tasks and returns them to the user code when the MapReduce operation completes. The value of the current counter is also displayed in the Master Status page, So that people can see the actual calculation progress. When calculating counter values, eliminate the effect of repetitive execution and avoid the accumulation of data. (in the use of alternate tasks, and due to the recurrence of errors, some counter values are maintained automatically by the MapReduce library, such as the number of input key/value pairs being processed, and the number of output key/value pairs produced.
The
user discovers that the Counter tool is useful for checking the integrity of the mapreduce operation. For example, in some mapreduce operations, user code may want to make sure that the number of output pairs is exactly equal to the number of input pairs. or the number of processed German documents is within the reasonable range of the number of documents being processed.
5 performance in this section, We measure the performance of MapReduce with two computations running on a large cluster. A calculation is used to find a specific matching string in a roughly 1TB of data. Another calculates the sort of roughly 1TB of data. These two programs represent a large subset of the real programs implemented by MapReduce users. A class of , transferring data from one representation to another. The other is to extract a small amount of care from a large dataset. 5.1 Cluster configuration All programs are performed on a cluster that contains approximately 1800 machines. The configuration of the machine is: 2 2G Intel Xeon Hyper-threading processor, 4GB memory, Two 160GB IDE disks, a gigabit NIC. These machines are deployed in a two-storey, tree-switched network with approximately 100 to 2000G of bandwidth on the root node. All of these machines have the same deployment (Peer-to-peer deployment), so the round-trip time between any two points is less than 1 milliseconds . In 4GB of RAM, there are probably 1-1.5GB that are used to run other tasks in the cluster. This program starts on a weekend afternoon, when the CPU, disk, and network are basically idle. 5.2Grep This grep program scans about 10^10, each 100-byte record, Find a less than 3-character lookup string (this lookup string appears in 92,337 Records). The input data is split into roughly 64MB slices (m=15000), and the entire output is stored in a single file (r=1). Figure 2 shows how the calculation process changes over time. The y-axis indicates the speed at which the input data is scanned. As more clusters are allocated to this mapreduce calculation, the speed is gradually increasing, when there are 1764 worker, this speed achieves the highest 30gb/s. When the map task completes, the speed begins to drop, 80 seconds after the start of the calculation, The input speed drops to 0. This calculation lasts about 150 seconds. This includes the first about a minute of startup time. The startup time is used to propagate the program to all machines, waiting for GFS to open 1000 input files, and to get the necessary location optimization information. 5.3 Sort this kind of sequencing 10^ 10 records, 100 bytes per record (approximately 1TB of data). This program is modeled after Terasort. This sort program contains only less than 50 lines of user code. There are 3 rows of map functions to extract the 10-byte sort key from the line of text. and produces an intermediate key/value pair that consists of this key and the original line of text. We use a built-in identity function as the reduce operation. This function directly key/value the middle pair as output key/ Value pairs. The final sort output is written to a 2-way copy of the GFs file (i.e., the output of the program will write 2Tb data). As before, the input data is segmented into a 64MB slice (m=15000). We write the sorted output into 4,000 files (r=4000). Partition functions use the original byte of the key to partition the data into a small slice of R. We take this benchmark partition function, Know the distribution of keys. In the general sequencing program, we will add a preprocessing mapreduce operation, which is used to sample the case of the key, and use the distribution of this sample key to compute the partition point for the final sort processing. Figure 3 (a) shows the normal execution of this sort program. The upper left figure shows the read speed of the input data. This is the highest speed to reach 13gb/s, And quickly slipped to 0 after all map tasks were completed in less than 200 seconds. Note that this input speed is less than grep. This is because this sort of map task takes about half the time and bandwidth, To write the intermediate data to the local hard drive. and grep-related intermediate data is negligible. The diagram in the left shows the speed at which data is transmitted over the network from the map task to the reduce task. When the first map task is complete, This sort of process begins. The first peak on the diagram is to start the first batch of approximately 1700 reduce tasks (the entire MapReduce task is assigned to 1700 machines, and each machine performs only one reduce task at a time). About 300 seconds after the start of the calculation, Some of the first reduce tasks were completed, and we started to perform the remaining reduce tasks. The entire sort process lasted about 600 seconds. The bottom left figure shows how quickly the sorted data is written to the final file by the reduce task. Because the machine is busy sorting intermediate data, So there is a delay at the end of the first sort stage and at the beginning of the write phase. The speed of writing is probably 2-4gb/s. The 850-second write process ends when the calculation begins. including the previous boot process, the entire calculation task lasted 891 seconds. This and Terasort Benchmark's record is about 1057 seconds. The thing to be aware of is that the reason for location optimization is that much of the data is read from the local disk and not through our limited bandwidth network, So the input speed is faster than the sorting speed and output speed. The reason for sorting faster than output is that the output phase writes two copies of the sorted data (the reason we write two copies is for reliability and availability). We wrote two copies because of the reliability and availability requirements of the underlying file system. If the underlying file system uses a similar fault-tolerant encoding (erasure coding) rather than a copy-write approach, it can reduce network bandwidth requirements during the write disk phase. 5.4 The impact of the standby task in Figure 3 (b) shows the execution of the sequencing program without the standby task. Except it has a long almost no write action happening outside the tail that executesThe process is similar to Figure 3 (a). After 960 seconds, only 5 reduce tasks were not completed. However, it is the last few laggards who know 300 seconds to complete. The entire task performed for 1283 seconds and took 44% more time. 5.5 Machine failure in Figure 3 (c), Shows our intention to stop the program on the 200 machines in the 1746 worker in the sequencing program calculation. The underlying cluster scheduler immediately restarts the new worker program on these machines (because only the program is stopped and the machine is still working). Because the completed map work has been lost (because the associated map worker was killed), it needs to be done again, so the worker's death will result in a negative input rate. The execution of the associated map task was quickly restarted. The entire calculation process is completed in 933 seconds, Includes the front start time (5% more time than normal execution time). 6 experience we wrote the first version of the MapReduce Library in February in 2003, and made significant enhancements in August 2003, including location optimization, dynamic load balancing of tasks performed by worker machines, Wait a minute. From that time, we were surprised to find that the MapReduce function library is widely used in our day-to-day problems. It is now widely used within Google, including: large-scale machine learning problems Google News and Froogle product's machine problem. Extract the data to produce a popular query report (for example, Google zeitgeist). Extracts the properties of a Web page for new trials and products (for example, extracting location information from a large collection of Web pages Use the location query to . large scale graphs. Figure 4 shows the significant increase in the MapReduce program over time in our main source control system, from the 0 growth in the early 2003 to almost 900 different programs in September 2004. MapReduce's success, because he was able to write a simple large-scale concurrent program that could be applied to thousands of machines in less than half an hour, and greatly improved the cycle efficiency of development and prototype design. And he can make a programmer who has no experience in distributed and/or parallel systems at all, Can easily use a lot of resources. At the end of each task, the MapReduce function library records the statistics of the computing resources used. In Figure 1, We have listed the statistics for some of the mapreduce jobs that were run by Google in August 2004.6.1 Large scale indexing so far, the most successful MapReduce application is rewriting the Google Web search service usingto the index system. The indexing system handles the massive collection of documents captured by the crawler system, which is stored in the GFS file. The original content of these documents is larger than 20TB. Indexing program is done through a series of About 5 to 10 mapreduce operations to build an index. There are some benefits from using MapReduce (replacing the version of the indexer with the last version of the specially designed distribution process): index is simple, small, easy to understand, because of fault tolerance, distributed, Parallel processing is hidden in the MapReduce library. For example, when you use the MapReduce function library, the number of lines of code calculated is reduced from the original 3800 lines of C + + code to about 700 lines of code . The performance of the MapReduce library is already very good, so we can separate the conceptually unrelated computational steps, rather than mix them together, in order to reduce the processing on the data. This makes it easy to change the indexing process. For example, a small change to the old indexing system could take months. But in the new system, it takes only a few days to do it.
The
indexing system is easier to operate because of machine failures, slow machines, and network failures that have been resolved by mapreduce themselves without the need for operator interaction. In addition, We can improve processing performance simply by adding machines to the index system.
7 related work Many systems provide rigorous design patterns and automate parallel computations through strict restrictions on programming. For example, a binding function can be computed using a parallel prefix on n processors by the prefix of an array of n elements in the log The time of n is calculated. MapReduce is a simplification and refinement of these models based on our experience with large realistic computations. And we also offer fault-tolerant implementations based on thousands of processors. Most concurrent processing systems are implemented on a small scale, and machine fault tolerance is controlled by programmers. . Bulk synchronous programming and some MPI primitives provide a higher level of abstraction that makes it easier to write parallel processes. The difference between these systems and the MapReduce system is that MapReduce uses strict Programming mode automatically implements concurrent processing of user programs, and provides transparent fault-tolerant processing. Our local optimization strategy is inspired by technologies such as active disks, where the computing task is pushed as far as possible to the processing unit near the local disk in the active disks, thus reducing the pass through the i/ o The amount of data in a subsystem or network. Instead of directly connecting to the disk controller, we have a small number of disks connected directly to the processor. But the general steps are similar. Our standby task mechanism is similar to the active scheduling mechanism on Charlotte Systems. One drawback of this simple active dispatch is that if a task causes a repetitive failure, The whole calculation will not be completed. We solved the problem to some extent by skipping the fault logging mechanism in the event of a failure. The MapReduce implementation relies on a built-in cluster management system to distribute and run user tasks on a large-scale shared machine group. Although this is not the focus of this paper, But the cluster management system is the same in concept as Condor and other systems. The sorting tool in the MapReduce library is similar to Now-sort in operation. The source machine (map worker) splits the data that will be sorted and sends it to r reduce On one of the worker's. Each reduce worker is to sort its data locally (if possible, in memory). Of course, Now-sort has no user-defined map and reduce functions, so that our libraries can be widely used. River provides a programming model, In this model, the processing process can communicate with each other by sending data on a distributed queue. Like MapReduce, the river system tries to provide an approximate average performance for different applications, even in the context of an unequal hardware environment or in the event of a system turbulence that provides approximate average sex. River is a carefully scheduled hard drive and network communication toBalance the completion time of the task. MapReduce is not different from it. Using the strict programming model, the MapReduce architecture is used to divide the problem into a large number of tasks. These tasks are automatically scheduled on the available worker so that the fast worker can handle more tasks. This rigorous programming model also allows us to To schedule redundant execution at the end of the work to reduce the completion time in the case of an inconsistent process (for example, when there is a slow machine or a blocked worker). Bad-fs is a very mapreduce, completely different programming model, and its goal is to perform work on a broad network. However, they have two basic principles that are the same. (1) The two systems use redundant execution to recover from data loss caused by failure. (2) The two systems use localized scheduling strategies to reduce the number of data sent over congested network connections. TACC is a system designed to simplify highly efficient network service architectures. Like MapReduce, it implements fault tolerance by executing again. 8 Conclusion MapReduce programming model has been successfully used in Google for different purposes. We attributed this success to the following reasons: First, This model is simple to use, even for programmers without parallel and distributed experience, because it hides the details of parallelism, fault tolerance, location optimization, and load balancing. Second, a lot of different problems can be expressed by mapreduce. For example, MapReduce is used, Generate data for Google's product web search services, sequencing, data mining, machine learning, and many other systems. Third, we have developed this mapreduce on a large cluster of thousands of computers. This implementation makes the use of these machine resources very simple, It also applies to solving many of the other problems that Google encounters that require a lot of computing. We have learned something from this work. First, the rigorous programming model makes parallelization and distributed computing simple and easy to construct such a fault-tolerant computing environment. Second, Network bandwidth is the bottleneck of the system. Therefore, in our system a large number of optimization goals are to reduce the amount of data sent over the network, local optimization using our local disk to read data, and the middle of the data to write to the local disk, to preserve network bandwidth. Third, the execution of redundancy can be used to reduce the impact of slow machines, and control machine failure and data loss. Thank you, Josh. Levenberg and extended the user-level MapReduce API, and added a lot of new features in combination with his application experience and the suggestions of other people's improvements. MapReduce reads and writes data from GFS. We want to thank Mohit Aron,howard Gobioff,markus Gutschke,david krame,shUn-tak Leung, and Josh Redstone, they are working on GFS. We also thank Percy Liang Olcan to work on the development of the cluster management system for Sercinoglu. Mike MapReduce, Wilson Hsieh,josh Levenberg,sharon Perl,robpike,debby Wallach has made valuable suggestions for this thesis. OSDI anonymous reviewer, and our auditor Eric Brewer, Some useful opinions are given on how to improve the paper. Finally, we thank all the MapReduce users of Google's engineering department for their helpful feedback, suggestions, and bug reports, and so on. A Word Frequency statistics This section contains a complete program to count the frequency of each different word in the input file specified in a set of command lines. #include "mapreduce/mapreduce.h"//user Map function class WordCounter: Public Mapper { public: virtual void Map (const mapinput& input) { & nbsp; Const string& Text = Input.value (); const int n = text.size (); for (int i = 0; i < n;) { //Skip leading spaces while ((I < n) && isspace (Text[i])) i++; //Find the end position of the word int start = i; while ((I < n) &&!isspace ( Text[i]) i++; if (Start < i) Emit (Text.substr (Start,i-start), "1"); } } }; register_mapper (WordCounter)//user's Reduce function class Adder:public reducer { virtual void Reduce (reduceinput* input) { // Iterate over all entries with the same key and add their value int64 Value = 0; while (!input->done ()) { value + + stringtoint (Input->value ()); Input->nextvalue (); } //Submit a comprehensive for this input key Emit (inttostring (value)); } }; Register_reducer (Adder); int main (int argc, char** argv) { parsecommandlineflags ( ARGC, argv); mapreducespecification spec; Deposit the input file list in spec for (int i = 1; i < argc i++) { & nbsp; mapreduceinput* input = Spec.add_input (); Input->set_format ("text"); Input->set_filepattern (Argv[i]); Input->set_mapper_class ("WordCounter"); } //Specify Output file: ///gfs/test/freq-00000-of-00100 ///gfs/test/freq-00001-of-00100 //... MapReduceOutput* out = Spec.output () ; out->set_filebase ("/gfs/test/freq"); Out->set_num_tasks (; ) out->set_format ("text"); Out->set_reducer_class ("Adder"); /optional operation: Do some cumulative work in the map task,To save bandwidth out->set_combiner_class ("Adder"); Adjustment parameters: Use 2000 machines per task 100MB memory Spec.set_machines (Watts); spec.set_map_megabytes (; ) spec.set_reduce_ Megabytes; //Run it mapreduceresult result; if (! MapReduce (spec, &result)) abort (); //complete: ' Result ' structure contains count, time spent, and use of machine information return 0;