Introduction to Google Map/Reduce framework Skynet in Ruby, rubyskynet
Skynet is a very loud name, because it is the super computer network that rules humans in the movie Terminator, a classic movie starring Arnold Schwarzenegger. However, Skynet in this article is not so scary. It is just the name of a ruby-version Google Map/Reduce framework.
Google's Map/Reduce framework is very famous. He can divide a task into multiple parts and hand it over to n computers for parallel execution. Then, the returned results are merged in parallel, finally, the calculation result is obtained. It is said that Google will Map a search result to 7000 servers for parallel execution. What a terrible distributed computing capability! With Map/Reduce, programmers can write robust and parallel distributed applications with simple code without paying attention to the distributed framework, in addition, the cluster computing capability of computers can be fully utilized.
There are already several frameworks that can implement the Map/Reduce algorithm. The most famous one is Hadoop, an open-source project initiated by Yahoo. However, Hadoop is not written in ruby, however, in the ruby world, Adam Pisoni has developed the ruby Map/Reduce framework, which is Skynet.
Adam Pisoni developed Skynet because Adam Pisoni's company Geni.com is an Internet website located on the family SNS. The News Push function provided by the website requires that the content of interest to specific users can be extracted from the information generated by a large number of users and pushed to users. This is actually a distributed computing model. We need to be able to distribute tasks to multiple servers for execution, and finally merge the tasks back. Adam Pisoni did not find a suitable framework. Finally, he developed Skynet and used the Map/Reduce algorithm to provide the distributed computing platform.
Developing a Map/Reduce distributed application using Skynet is very simple. Let's take a simple example: Suppose there is a 1 GB text file, our task is to count the number of times each word appears in this file. The traditional method is of course very simple. It is enough to read the file content in sequence and perform word statistics, but there is no doubt that the execution speed will be very slow. If we have a computing cluster of 1000 servers, how can we use Skeynet to concurrently execute this program to shorten the statistical time?
The Map/Reduce algorithm process is as follows:
1. Partition)
Divide the data into 1000 parts, which is automatically completed by Skynet.
2. Map
In addition to dividing data, you also need to Map the code that computes the data to each computing node for concurrent execution. Each of the 1000 nodes executes their own tasks, and the execution result is returned after the execution.
3. Partition
The execution results of these 1000 points need to be merged, So we divide the data again, for example, into 10 copies, which is automatically completed by Skynet.
4. Reduce
Distribute Reduce code and Reduce data to 10 nodes for execution. After each node is executed, data is returned. If you need to Reduce again, you can execute it again. Eventually Reduce is a total of results.
In fact, the principle of the Map/Reduce algorithm is very simple. Well, let's look at how we implement it under Skynet? In fact, the code we need to write has only two methods: A map method, telling skynet how to execute each piece of data, a reduce method, and how skynet merges each piece of data, therefore, this parallel algorithm is also very simple to write using Skynet:
Copy codeThe Code is as follows:
Class MapreduceTest
Include SkynetDebugger
Def self. map (datas)
Results = {}
Datas. each do | data |
Results [data] | = 0
Results [data] + = 1
End
[Results]
End
Def self. reduce (datas)
Results = {}
Datas. each do | hashes |
Hashes. each do | key, value |
Results [key] | = 0
Results [key] + = value
End
End
Results
End
End
This is the simplest but complete ruby version of Map/Reduce code. We need to write a map method to tell skynet to count the number of occurrences of each word. We also need to write a reduce method to tell skynet to merge the statistical results of each map. All right, all the work is taken over by Skeynet, isn't it easy!
Of course, we still need to do some work to make this Map/Reduce run, such as installing skynet and configuring parallel nodes of skynet. For these trivial tasks, please refer to skynet's own documents: http://skynet.rubyforge.org/doc/index.html.
It is worth mentioning that Skynet can be well integrated with the Rails framework. You can discard some time-consuming and Map/Reduce jobs in Rails to Skynet for asynchronous backend execution. For example:
Copy codeThe Code is as follows:
MyModel. distributed_find (: all,: conditions => "created_on <'# {3. days. ago}'"). each (: some_method)
Hand over some_method, the time-consuming operation to be executed after all model queries in the last three days, to Skynet, so that Skynet can use its powerful computing network for execution.
It can also be executed asynchronously:
Copy codeThe Code is as follows:
Model_object.send_later (: method, options,: save)
Hand over time-consuming tasks to Skynet for asynchronous execution.
Skynet is a great tool for websites with powerful computing networks that require a large amount of time-consuming computing. It allows programmers to easily write and process robust and efficient distributed applications!