Working with text is a common usage of the MapReduce process, because text processing is relatively complex and processor-intensive processing. The basic word count is often used to demonstrate Haddoop's ability to handle large amounts of text and basic summary content.
To get the number of words, split the text from an input file (using a basic string tokenizer) for each word that contains the count, and use a Reduce to count each word. For example, from the phrase The quick brown fox jumps over the lazy dog, the Map phase generates the output in Listing 1.
Listing 1. Map Phase Output
the, 1quick, 1brown, 1fox, 1jumps, 1over, 1the, 1lazy, 1dog, 1
The Reduce phase then aggregates the number of occurrences of each unique word and gets the output shown in Listing 2.
Listing 2. The output of the Reduce phase
the, 2quick, 1brown, 1fox, 1jumps, 1over, 1lazy, 1dog, 1
Although this approach is useful for basic word counts, you often want to recognize the presence of important phrases or words. For example, get comments from Amazon on different videos and videos.
With information from Stanford University large data items, you can download the movie comment data. This data contains the ratings and usefulness of the original comments (reported on Amazon), as shown in Listing 3.
Listing 3. Download Movie review data
Product/productid:b003ai2vgareview/userid:a3qydl5cdnyn66review/profilename:abra "A devoted reader" review/ Helpfulness:0/0review/score:2.0review/time:1229040000review/summary:pretty Pointless Fictionalizationreview/text : The murders in Juarez are real. This movie is a badly acted fantasy of revenge and Holy. If There is a does movie about Juarez, I don ' t know what it is, but it isn't this one.
Note that while the reviewer has played 2 points (1 is the worst and 5 is the best), the commentary describes the film as a very poor movie. We need a confidence score to be able to understand whether the given score matches the actual comment.
Many tools are available to perform advanced heuristic analysis, but basic processing can be implemented using a simple index or regular expression. Then we can count the positive and negative regular expression matches to get a score for a movie.
Figure 1. Statistics positive and negative regular expression matches number to get a score on a movie
For the Map section, counts the number of individual words or phrases in the movie commentary, providing a single count for both positive and negative evaluations. The Map operation counts the score of the movie from the product Review, and the Reduce operation then summarizes the scores by product ID to provide positive or negative ratings. So the MAP is similar to listing 4.
Listing 4. A MAP function that provides a single count for positive and negative comments
List of positive words/phrases static string] Pwords = {"OK", "excellent", "brilliant movie"};//List of negative words/ Phrasesstatic string] nwords = {"Upgraded", "bad", "unwatchable"}; int count = 0;for (string word:pwords) {string REGEX = "\\b" + Word + "\\b"; Pattern p = pattern.compile (REGEX); Matcher m = P.matcher (INPUT); while (M.find ()) {count++.} for (string word:nwords) {string REGEX = "\\b" + Word + "\\b"; Pattern p = pattern.compile (REGEX); Matcher m = P.matcher (INPUT); while (M.find ()) {count--}} Output.collect (ProductId, Count);
Reduce can then be calculated as a traditional sum of content.
Listing 5. Reduce function that sums positive and negative comments by product ID
public static class Reduce extends Reducer<text, intwritable, text, intwritable> {public void reduce (text key, Iterable<intwritable> values, Context context) throws IOException, interruptedexception {int sum = 0; Intwritable val:values) {sum + + val.get ();} context.write (Key, New intwritable (sum)); }}
The result is the confidence score of the comment. You can extend the word list to contain the phrases you want to match.
Reading and writing JSON data
JSON has become a useful data interchange format. Its practicality stems, in part, from its simple nature and structure, as well as the ease of parsing in so many languages and environments.
The most common format for parsing incoming JSON data is a JSON record for each symbol input line.
Listing 6. One JSON record for each symbol input line
{"ProductId": "B003AI2VGA", "Score": 2.0, "text": "" "{productId": "B007bi4dat", "score": 3.4, "text": "" {" ProductId ":" B006ai2fdh "," Score ": 4.1," text ":" ""}
This code can be easily parsed by converting incoming strings to JSON objects by using the appropriate class, such as Gson. When you use this method for Gson, you will need to serialize to a predetermined class.
Listing 7. To serialize to a predetermined class
class Amazonrank {private string productId; private float score; private string text; Amazonrank () {}}
Resolves incoming text as follows.
Listing 8. Parsing incoming text
public void Map (Object key, Text value, Context context) throws IOException, Interruptedexception {try {amazonrank rank = Gson.fromjson (Value.tostring (), amazonrank.class);
To write JSON data, you can perform the opposite action. Create an output class that you want to match with the JSON output within the MapReduce definition, and then use the Gson class to convert this to a JSON representation of this structure.
Listing 9. Writing JSON Data
class Reciperecord {private string recipe; private string recipetext; private int recipeid; private float calories; private float fat; private float weight; Reciperecord () {}}
You can now populate an instance of an object in the output and convert it to a single JSON record.
Listing 10. Populating an instance of an object during output
recipenutrition recipe = new Reciperecord (); Recipe.recipeid = Key.tostring (); recipe.calories = sum; Gson json = new Gson (); Output.collect (Key, New Text (Json.tojson (recipe));
If you want to use a third-party library in a Hadoop processing job, be sure to include the library JAR file with the MapReduce code: $ JAR-CVF recipenutrition.jar-c recipenutrition/* Google-gson /gson.jar.
While outside the Hadoop MapReduce processor, another alternative is to use JAQL, which parses and processes JSON data directly.
merging datasets
3 types of merges are typically performed in a MapReduce job:
combines the contents of multiple files with the same structure. Combine the contents of multiple files with similar structure that you want to combine. Joins data from multiple sources related to a specific ID or keyword.
The first option is best handled outside the typical MapReduce job because it can be done using the Hadoop distributed File System (HDFS) Getmerge operation or a similar operation. This operation accepts a single directory as content and outputs it to a specified file. For example, the $ Hadoop fs-getmerge srcfiles Megafile merges all the files in the Srcfiles directory into one file: Megafile.
Merging similar files
To merge similar but not identical files, the main problem is how to identify the format that is used when entering and how to specify the format of the output. For example, given file name, phone, count, and second file name, email, phone, count, you are responsible for determining which file is correct and executing the MAP to generate the desired structure. For more complex records, you may need to perform more complex merges of fields that contain and do not contain null values during the MAP phase to generate information.
In fact, Hadoop is not an ideal choice for this process unless you also use it as an opportunity to simplify, count, or reduce information. That is, you identify the number of incoming records, what the possible formats are, and perform Reduce on the field you want to select.
Join
Although there are potential solutions to perform joins, they often rely on processing information in a structured way and then use this structure to determine what to do with the output information.
For example, given two different information threads, such as e-mail addresses, the number of e-mail messages sent, and the number of e-mail addresses received, the goal is to merge the data into one output format. This is the input file: email, sent-count and email, received-count. Output should be in this format: email, sent-count, Received-count.
Process incoming files and output content in different ways so that files and data can be accessed and generated in different ways. Then rely on the Reduce function to perform the simplification. In most cases, this will be a multi-stage process:
a stage to process "sent" emails to email, fake#sent form output information
Note: We use pseudo prefixes to adjust the order so that the data can be reconciled by pseudo prefix, not by the prefix received. This practice allows the data to be joined in a false, implied order.
a stage to process "sent" emails, received the form of information.
When the MAP function reads a file, it generates some rows.
Listing 11. Born
dev@null.org,0#sentdev@null.org, received
The MAP recognizes input records and outputs a unified version with one key. Outputs and generates a sent#received structure to handle the content, determining whether the value should be merged together or summarized as a single received value.
Listing 12. Output a unified version with one key
int sent = 0; int received = 0; for (Text val:values) {string strval = Val.tostring (); Buf.append (Strval). Append (","); if (Strval.contains ("#")) {string [] tokens = Strval.split ("#"); If The content contains a hash, assume it ' s sent and received int recvthis = Integer.parseint (tokens[0)); int sentthis = Integer.parseint (tokens[1]); Received = Received + Integer.parseint (recvthis); Sent = Sent _ Sentthis; else {//otherwise, it's ethically the received value received = Received + Integer.parseint (strval);} Context.write (Key, Intwritable (Sendreplycount), New Intwritable (Receivereplycount));
In this case, we rely on the simplification within the Hadoop itself to press that key to simplify the output data (in this case, the email address), simplifying the information we need. Because this information is an e-mail key, records can be easily merged by e-mail as a key.
Tips for using keys
Keep in mind that some aspects of the MapReduce process can be used for us. In essence, MapReduce is a two-phase process:
The
Map phase accesses the data, picks the information you need, and then outputs the information using a key and associated information. The Reduce phase simplifies data by using common keys to combine, summarize, or count mapped data into a simpler form.
Keys are an important concept because they can be used to format and summarize data in different ways. For example, if you plan to simplify data about the country and city population, you can output only one key to simplify or summarize data by country.
Listing 13. Output only one key
franceunited Kingdomusa
To be aggregated by country and city, the key is a composite version of the two.
Listing 14. The key is a composite version of the country and the city
france#parisfrance#lyonfrance#grenobleunited kingdom#birminghamunited Kingdom#london
This is a basic technique that we can use when dealing with certain types of data (for example, a material with a common key) because we can use it to simulate pseudo joins. This technique is also useful when composing blog posts (with a blogpostid for easy identification) and blog comments (with a blogpostid and Blogcommentid).
To simplify the output (such as counting the number of words in blogs and comments), we first deal with blog posts and blog comments via Map, but we output a generic ID.
Listing 15. Reduction of output
Blogpostid,the,quick,brown,foxblogpostid#blogcommentid,jumps,over,the,lazy,dog
This obviously uses two keys to output the information to two different rows of information. We can also reverse this relationship. We can identify words from comments by adding a comment ID to each word.
Listing 16. Reverse relationship
Blogpostid,the,quick,brown,fox,jumps#blogcommentid,over#blogcommentid,the#blogcommentid,lazy#blogcommentid , Dog#blogcommentid
During processing, we can see whether the word is attached to a blog post by looking at the ID and whether it is attached to a blog post or comment in that format.
Simulate traditional database operations
Hadoop is not really a real database, in part because we can't perform updates, deletes, or inserts line by row. Although this is not a problem in many cases (you can dump and load the active data you want to work with), sometimes you do not want to export and reload the data.
A trick to avoid exporting and reloading data is to create a change file that contains a list of differences from the original dump file. Now we temporarily ignore the process of generating this data from SQL or other databases. As long as the data has a unique ID, we can use it as a key to take advantage of the key. Let's look at a source file similar to listing 17.
Listing 17. source files
1,london2,paris,3,new York
Suppose you have a change file similar to listing 18.
Listing 18. Change file
1,delete2,update,munich4,insert,tokyo
The resulting merged results of two files are resolved, as shown in Listing 19.
Listing 19. Merging source and change files
2,munich3,new York4,tokyo
How do we implement such a merger through Hadoop?
One way to implement this merge with Hadoop is to process the current data and convert it to insert data (since they are all new data inserted into the destination file), and then convert the udpate operation to the DELETE and insert operations of the new data. In fact, using a change file is easier to do by modifying it to the content in Listing 20.
Listing 20. Merging with Hadoop
1,delete2,delete2,insert,munich4,insert,tokyo
The problem is that we can't physically merge two files, but we can handle them accordingly. If it is an original INSERT or DELETE, we will output a key with a counter. If it is an UPDATE operation that creates a new insert data, we want a different key that does not simplify, so we generate a gap (interstitial) file similar to listing 21.
Listing 21. Generate Gap File
1,1,london2,1,paris,3,1,new York 1,-1,london2,-1,paris2#new,munich4#new,1,tokyo
During Reduce, we summarize the contents of the counters for each unique key and generate listing 22.
Listing 22. Summarize the contents of the counters for each unique key
1,0,london2,0,paris,3,1,new York 2#new,1,munich4#new,1,tokyo
We can then run the content through an auxiliary MapReduce function, using the basic structure shown in Listing 23.
Listing 23. Running content with a secondary MapReduce function
map:if (Key contains #NEW): Emit (Row) if (Count >0): Emit (Row)
The secondary MapReduce will get the expected output, as shown in Listing 24.
Listing 24. Expected output of auxiliary MapReduce function
3,1,new York 2,munich4,1,tokyo
Figure 2 illustrates the two-stage process of first formatting and simplifying, and then streamlining the output.
Figure 2. Two-stage process of formatting, simplification and simplifying output
This process requires more work than a traditional database, but it provides solutions that require a much simpler exchange of constantly updated data.
Concluding
This article describes many different scenarios that use MapReduce queries. You see the power of these queries on a variety of data, and you should now be able to take advantage of these examples in your own MapReduce solution.