Reprinted: http://www.ibm.com/developerworks/cn/opensource/os-riak1/index.html
Introduction
Typical modern relational databases are mediocre in some types of applications and cannot meet the performance and scalability requirements of today's Internet applications. Therefore, different methods are required. In the past few years, a new type of data storage has become very popular, often referred to as nosql, because it can directly solve some defects of relational databases. Riak is one of these data storage types.
Riak is not the only nosql data storage. The other two popular Data Storage types are MongoDB and Cassandra. Although very similar in many ways, there are also obvious differences between them. For example, Riak is a distributed system, while MongoDB is a separate system database. That is to say, Riak does not have the concept of a master node, so it has better elasticity in fault handling. Although Cassandra is also based on Amazon's dynamo description, it does not support vector clock and compatible hashes in organizing data. Riak data models are more flexible. In Riak
These buckets are dynamically created during bucket creation. The Cassandra Data Model is defined in the XML file. Therefore, you need to restart the entire cluster after modifying the data models.
Another advantage of Riak is that it is written in Erlang. MongoDB and Cassandra are written in common languages (C ++ and Java respectively). Therefore, Erlang supports distributed and fault-tolerant applications from the very beginning, therefore, it is more suitable for developing nosql data storage and other applications. These applications have some common characteristics with applications written using Erlang.
MAP/reduce jobs can only be written in Erlang or JavaScript. For this article, we choose to use JavaScript to writemap
And
reduce
Function, but you can also write them using Erlang. Although Erlang code may be executed faster, the reason for choosing JavaScript code is that it has a wider audience. See
For more information, see Erlang.
Back to Top
Start
If you want to try some examples in this article, you need to install Riak in your system (see
References) and Erlang.
You also need to build a cluster containing three nodes and run it on your local machine. All data stored in the Riak is copied to a large number of nodes in the cluster. An attribute (n_val) of the bucket where the data is located determines the number of nodes to be copied. The default value of this attribute is 3. Therefore, to complete this example, we need to create a cluster containing at least three nodes (then you can create any number of nodes ).
After the source code is downloaded, you need to build it. The basic steps are as follows:
- Decompress the source code:
$ tar xzvf riak-1.0.1.tar.gz
- Modify directory:
$ cd riak-1.0.1
- Build:
$ make all rel
This will build Riak (./Rel/Riak ). To run multiple nodes locally, You need to generate a copy of./Rel/Riak and use one copy for each additional node. Copy./Rel/Riak to./Rel/riak2,./Rel/riak3, and perform the following modifications for each copy:
- In riakn/etc/APP. config, modify the following values: ports, handoff_port and pb_port specified in HTTP {} and change them to unique values.
- Open riakn/etc/VM. ARGs and modify the name. It is also changed to a unique value, for example
-name riak2@127.0.0.1
Start each node in sequence, as shown in figure
List 1.
Listing 1. Listing 1. starting each node
$ cd rel$ ./riak/bin/riak start$ ./riak2/bin/riak start$ ./riak3/bin/riak start |
Finally, connect the nodes to form a cluster, such
List 2.
Listing 2. Listing 2. forming a cluster
$ ./riak2/bin/riak-admin join riak@127.0.0.1$ ./riak3/bin/riak-admin join riak@127.0.0.1 |
Now you have created a three-node cluster that runs locally. To perform a test, run the following command:$ ./riak/bin/riak-admin status | grep ring_members
.
You should see that each node is part of the cluster just created, suchring_members : ['riak2@127.0.0.1','riak3@127.0.0.1','riak@127.0.0.1']
.
Back to Top
Riak API
There are currently three methods to access Riak: http api (restful Interface), Protocol buffers, and a native Erlang interface. Provides multiple interfaces for you to choose how to integrate applications. If you use Erlang to write applications, you should use the native Erlang interface so that the two can be tightly integrated. Other factors also affect the selection of the interface, such as performance. For example, the performance of the client using the Protocol buffers interface is higher than that of the client using http api. In terms of performance, the data communication volume becomes smaller and all these HTTP
Header overhead is relatively higher. However, the advantage of using http apis is that most developers today (especially web developers) are very familiar with the restful interface, and most programming languages have built-in primitives, you can use HTTP to request resources. For example, if you open a URL, no additional software is required. In this article, we will focus on HTTP APIs.
All examples Use curl to interact with Riak through the HTTP interface. This is done to better understand the underlying API. Many languages provide a large number of client libraries. These client libraries should be considered when developing applications that use Riak as data storage. The client library provides APIs connected to Riak to easily integrate with applications. You do not have to write code to handle responses when using curl.
APIS support common HTTP methods:GET
,PUT
,POST
,DELETE
They are used to retrieve, update, create, and delete objects respectively. We will introduce each method in sequence later.
Storage objects
You can think of Riak as a distributed ing between key (string) and value (object. Riak stores the value in the bucket. Before saving an object, you do not need to create a bucket explicitly. If you save the object to a non-existing bucket, the bucket is automatically created.
A bucket is a virtual concept in Riak. It is mainly used to group related objects. The bucket also has other attributes whose values define how Riak processes the objects in the bucket. The following are examples of Bucket attributes:
n_val
: Number of times objects are replicated in the Cluster
allow_mult
: Whether concurrent updates are allowed
You can sendGET
Request to view the attributes (and current values) of the bucket ).
To store objects, we will
One of the URLs shown in listing 3 executes HTTPPOST
.
Listing 3. Listing 3. storage objects
POST -> /riak/<bucket> (1)POST -> /riak/<bucket>/<key> (2) |
Keys can be automatically allocated by Riak (1) or defined by the user (2.
You can also execute an HTTPPUT
Operation to create an object.
The latest version of Riak also supports the following URL formats:/buckets/<bucket>/keys/<key>, but in this article, we will use an older format to maintain backward compatibility with earlier Riak versions.
If no key is specified, Riak automatically assigns a key to the object. For example, we will store a plaintext object in the bucket "Foo" without explicitly specifying the key (see
Listing 4 ).
Listing 4. Listing 4. storing a plaintext object without explicitly specifying a key
$ curl -i -H "Content-Type: plain/text" -d "Some text" \http://localhost:8098/riak/foo/HTTP/1.1 201 CreatedVary: Accept-EncodingLocation: /riak/foo/3vbskqUuCdtLZjX5hx2JHKD2FTKContent-Type: plain/textContent-Length: ... |
By checking the location header, you can see the key allocated to the object by the Riak. This is not easy to remember, so another option is to provide users with keys. Let's create an artist bucket and add an artist named Bruce (see
Listing 5 ).
Listing 5. Listing 5. Creating an artist bucket and adding an artist
$ curl -i -d '{"name":"Bruce"}' -H "Content-Type: application/json" \http://localhost:8098/riak/artists/BruceHTTP/1.1 204 No ContentVary: Accept-EncodingContent-Type: application/jsonContent-Length: ... |
If the object is successfully stored using the specified key, we will get a 204 NO content response from the server.
In this example, the object value is saved as JSON, but it can either be in plaintext or other formats. When storing objects, you must set the Content-Type Header correctly. For example, if you want to store a JPEG image, you must set the content type to image/JPEG.
Search object
To retrieve stored objects, runGET
Method. If an object exists, the system returns the object in the response body; otherwise, the server returns the 404 object not found response (see
Listing 6 ).
Listing 6. Listing 6. ExecuteGET
Method
$ curl http://localhost:8098/riak/artists/BruceHTTP/1.1 200 OK...{ "name" : "Bruce" } |
Update object
The Content-Type header is used when updating objects, just like storing objects. For example, let's add Bruce's alias
List 7.
Listing 7. Listing 7. Adding Bruce's alias
$ curl -i -X PUT -d '{"name":"Bruce", "nickname":"The Boss"}' \-H "Content-Type: application/json" http://localhost:8098/riak/artists/Bruce |
As mentioned above, Riak automatically creates a bucket. These buckets have some attributes, one of which is allow_mult, used to determine whether concurrent write operations are allowed. By default, this attribute is set to false. However, if concurrent updates are allowed, the X-Riak-vclock header must be sent to each update. Set the header value to be the same as the value seen when the client reads the object for the last time.
Riak uses the vector clock to determine the reasons for modifying objects. The working principle of vector clock is beyond the scope of this article. However, conflicts may occur when concurrent write operations are allowed. In this case, you need to use applications to resolve these conflicts (see
References ).
Delete object
The delete object operation uses a mode similar to the previous command. We only need to execute an HTTPDELETE
Method:
$ curl -i -X DELETE http://localhost:8098/riak/artists/Bruce
.
If the object is successfully deleted, we will get a 204 NO content response from the server. If the object to be deleted does not exist, the server will return a 404 object not found response.
Back to Top
Link
So far, we have learned how to store objects by associating objects with specific keys. Later we can use this specific key to retrieve objects. If you can extend this simple model to show how the object (and whether it is) is related to other objects, this can be very useful. Of course we can achieve this, And Riak is implemented using links.
So what is a link? Links allow users to create relationships between objects. If you are familiar with the UML class diagram, You can regard the link as an association between objects and use a bookmark to describe this relationship. In a relational database, this relationship is represented as a foreign key.
Use the "Link" header to "Attach" the link to the object. The following shows what the link header looks like. For example, the Link Target (the object we are going to link) is the content in the angle brackets. The link content ("timer" in this example) is represented by the riaktag attribute:Link: </riak/artists/Bruce>; riaktag="performer"
.
Now let's add some albums and associate them with the performer Bruce of the album (see
Listing 8 ).
Listing 8. Listing 8. Adding some albums
$ curl -H "Content-Type: text/plain" \-H 'Link: </riak/artists/Bruce> riaktag="performer"' \-d "The River" http://localhost:8098/riak/albums/TheRiver$ curl -H "Content-Type: text/plain" \-H 'Link: </riak/artists/Bruce> riaktag="performer"' \-d "Born To Run" http://localhost:8098/riak/albums/BornToRun |
Now we have set some relationships. Next we will query them through link walking. Link walking is a process used to query object relations. For example, to find the artist performing the river album, you should do this:$ curl -i http://localhost:8098/riak/albums/TheRiver/artists,performer,1
.
The bit at the end is the link description. This is what the link query looks like. The first part (artists
) Specifies the bucket to be queried. The second part (performer
) Specifies the label we want to use to limit the results, and the final
1
It indicates that we want to include the results of this query phase.
You can also issue a transitional query. Assume that we have established a relationship between the album and the artist, as shown in figure
Figure 1.
Figure 1. Figure 1. Relationship between albums and artists
By executing the following command, you can issue a query like "which artists have worked with the artist performing the river album:$ curl -i http://localhost:8098/riak/albums/TheRiver/artists,_,0/artists,collaborator,1
. The underline in the Link Description is similar to a wildcard, indicating that we do not care about the specific relationship.
Back to Top
Run MAP/reduce Query
MAP/reduce is a framework promoted by Google to run distributed computing on large datasets at the same time. Riak also provides MAP/reduce support, which allows more powerful query of data running functions in the cluster.
The MAP/reduce function includes a map stage and a reduce stage. The MAP Phase applies to some data and generates 0 or more results; this process is similar to using each ing function in the list. The map stage occurs in parallel. The reduce stage obtains all the results of the map stage and combines them.
For example, calculate the number of times a word appears in a large number of documents. Each map stage calculates the number of times each word appears in a specific document. After calculation, these intermediate counts are sent to the reduce function, and then the total count is calculated and the number of counts in all documents is obtained. See
References to get a link to Google's map/reduce article.
Back to Top
Example: distributed grep
In this article, we will develop a map/reduce function, which will execute a distributed grep for a set of documents stored in Riak. Like grep, the final output is some rows that match the provided mode. In addition, each result also indicates the line number of the position in which the matching occurs in the document.
To execute a map/reduce query, we will execute the/mapred ResourcePOST
Operation. The request content is the JSON representation of the query. Like in the preceding example, the Content-Type Header must be provided and always set to application/JSON. Listing 9 shows the query we performed for execution of distributed grep. We will discuss each part of the query in sequence.
Listing 9. Listing 9. Example MAP/reduce Query
{ "inputs": [["documents","s1"],["documents","s2"]], "query": [ { "map": { "language": "javascript", "name": "GrepUtils.map", "keep": true, "arg": "[s|S]herlock" } }, { "reduce": { "language": "javascript", "name": "GrepUtils.reduce" } } ]} |
Each query contains several inputs. For example, the document we want to perform computing on it contains the name of the function that runs in the map and reduce phases. You can also includemap
And
reduce
The source code of the function only needs to replace the name with the source attribute, but I did not do this in this example. However, to use the specified function, you need to modify the default Riak configuration. Save the code in listing 9 to a directory. For each node in the cluster, find the file ETC/APP. config, open it, and set the Property js_source_dir to the directory where you want to save the code. Restart all nodes in the cluster to make the change take effect.
The code in listing 10 contains the functions that will be executed in the map and reduce phases.map
The function will view each line of the document and determine whether it matches the provided mode (arg
Parameter. In this example
reduce
A function does not perform too many operations. It is similar to a constant function and is only used to return input.
Listing 10. Listing 10. greputils. js
var GrepUtils = { map: function (v, k, arg) { var i, len, lines, r = [], re = new RegExp(arg); lines = v.values[0].data.split(/\r?\n/); for (i = 0, len = lines.length; i < len; i += 1) { var match = re.exec(lines[i]); if (match) { r.push((i+1) + “. “ + lines[i]); } } return r; }, reduce: function (v) { return [v]; } }; |
Before running the query, we need some data. I downloaded the Sherlock Holmes e-books from the Project Gutenberg web site (see
References ). The first text is stored in the "documents" bucket under the key "S1", the second text is located in the same bucket, and the key is "S2 ".
Listing 11 shows how to upload such documents to Riak.
Listing 11. Listing 11. Uploading documents to Riak
$ curl -i -X POST http://localhost:8098/riak/documents/s1 \-H “Content-Type: text/plain” --data-binary @s1.txt |
After you upload a document, you can search for it. In this example, we want to output a regular expression matching"[s|S]herlock"
(See
All rows in listing 12.
Listing 12. Listing 12. Searching documents
$ curl -X POST -H "Content-Type: application/json" \http://localhost:8098/mapred --data @-<<\EOF{ "inputs": [["documents","s1"],["documents","s2"]], "query": [ { "map": { "language":"javascript", "name":"GrepUtils.map", "keep":true, "arg": "[s|S]herlock" } }, { "reduce": { "language": "javascript", "name": "GrepUtils.reduce" } } ]}EOF |
In the queryarg
The attribute contains the mode in which we want to perform grep query on it in the document; this value is usedarg
Parameters are passed
map
Function.
Listing 13 shows the output of running a map/reduce job on the sample data.
Listing 13. Listing 13. sample output for running a map/reduce job
[["1. Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle","9. Title: The Adventures of Sherlock Holmes","62. To Sherlock Holmes she is always THE woman. I have seldom heard","819. as I had pictured it from Sherlock Holmes' succinct description,","1017. \"Good-night, Mister Sherlock Holmes.\"","1034. \"You have really got it!\" he cried, grasping Sherlock Holmes by" …]] |
Back to Top
Streaming MAP/reduce
In the last section of MAP/reduce, we will briefly understand the MAP/reduce streaming feature of Riak. This feature is useful for jobs that include the map stage and take some time to complete these stages, because streaming the results allows you to access them immediately after the results of each map stage are generated, and access them before the reduce stage is executed.
We can apply this feature to distributed grep queries. In this example, the reduce step does not have many actual operations. In fact, we can completely remove the reduce stage. We only need to directly send the results of each map stage to the client. To achieve this goal, you need to modify the query, delete the reduce step
?chunked=true
Adding to the end of the URL indicates that we want to stream the results (see
Listing 14 ).
Listing 14. Listing 14. modifying a query to Streaming results
$ curl -X POST -H "Content-Type: application/json" \http://localhost:8098/mapred?chunked=true --data @-<<\EOF{ "inputs": [["documents","s1"],["documents","s2"]], "query": [ { "map": { "language": "javascript", "name": "GrepUtils.map", "keep": true, "arg": "[s|S]herlock" } } ]}EOF |
After the map stage is completed, the results of each map stage (in this example, the rows matching the query string) are returned to the client. This method can be used for applications that process intermediate results when they are available.
Back to Top
Conclusion
Riak is an open-source, highly scalable key-value repository based on the Rules recorded in Amazon dynamo files. Riak is very easy to deploy and expand. You can seamlessly add additional nodes to the cluster. Features such as link walking and support for MAP/reduce allow for more complex queries. In addition to HTTP APIs, Riak also provides a native Erlang API and support for protocol buffer. In part 1 of this series, we will explore a large number of client libraries in different languages and demonstrate how to use Riak as a highly scalable cache.
References
Learning
- See Basic cluster setup and building a development environment for details about setting up a 3-node cluster.
- Read Google's mapreduce: simplified data processing on large clusters.
- Erlang programming introduction, part 1 (Martin Brown, developerworks, 1st): Compares Erlang's functional programming style with other programming modes, such as imperative, procedural, and object-oriented programming.
- Strongly recommended reading
Amazon dynamo documentation to learn the basic knowledge of Riak.
- Read articles
How to analyze Apache logs to learn how to use Riak to process your server logs.
- Understanding
Vector clock, and why they are easier to understand than you think.
- Find an excellent introduction to vector clock on Riak wiki, and learn more
Link walking information.
- If you need some text resources for testing, the Project Gutenberg site is a good choice.
- The developerworks China website Web development area provides articles covering various web-based solutions.
- For interesting interviews and discussions for software developers, see
Developerworks podcasts.
- IBM Rational Twitter: Join and follow developerworks tweets now.
- Watch
Demonstrate how to use WebSphere Studio to quickly develop web services, including product installation and setup demos for beginners and advanced functions provided for experienced developers.
- Stay tuned to developerworks
Technical activities and network broadcasts.
- Visit developerworks
The Open Source Area provides rich how-to information, tools, and project updates, as well as the most popular articles and tutorials to help you develop with open source code technology, they are used in combination with IBM products.
Obtain products and technologies
- Download the Riak from basho.com.
- Download the Erlang programming language.
- Use software tailored to developers to innovate your next open-source development project. Visit
IBM product evaluation trial software, which can be downloaded or obtained from a DVD.