Problem Introduction: if we look for more than 20 billion records from 200 records (about 100 Gb) without considering the cluster computing power, we can write mapreduce as follows: the data size is not directly considered. The reduce stage filters rows at a time.
public static class UserChainSixMapper extends Mapper<LongWritable, Text, Text, Text> { private static String prefix1 = "tm"; private Text outKey = new Text(); private Text outVal = new Text(); public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{ String path = ((FileSplit)context.getInputSplit()).getPath().toString(); if(path.contains("userchain")){ String [] vals = value.toString().split(","); if(vals.length == 8){ if(UserChainFirst.isUid(vals[0]) && UserChainFirst.isUid(vals[1])){ outKey.set(vals[0]); outVal.set(value); context.write(outKey, outVal); } } } else if(path.contains("userid")){ String val = value.toString(); if(UserChainFirst.isUid(val)){ outKey.set(val); outVal.set(prefix1); context.write(outKey, outVal); } } } }public static class UserChainSixReducer extends Reducer<Text,Text,Text,Text>{ private static String prefix1 = "tm"; private Text outKey = new Text(); Boolean flag = false; int count = 0; List<String> lists = new ArrayList<String>(); public void reduce(Text key,Iterable<Text> values,Context context)throws IOException, InterruptedException{ Iterator<Text> iter = values.iterator(); flag = false; count = 0; lists.clear(); while(iter.hasNext()){ String value = iter.next().toString();// System.out.println("key:" + key +"," + "value:" + value); if(value.contains(prefix1)){ flag = true; } else{ lists.add(value); } } if(flag){ for(String s : lists){ count ++; if(0 == count % 1000){ context.progress(); Thread.sleep(1*1000); } outKey.set(s); context.write(outKey, null); } } else{ lists.clear(); } } }}
In a job, it is found that when it reaches 99.6%, the job will not be able to pass, and one of the reduce jobs will not be able to pass. Memory overflow. Considering that too much data is cached to the memory in the reduce stage, lists. Add (value); causes memory overflow, the job cannot run. If you think about the unbalanced keys in the analysis, the load of a reduce is too heavy. Therefore, you should use partition for partitioning, but you do not know the step-by-step status of the UID key, use the first 8-bit uid hashcode partition (the UID is 10-bit or 11-bit ).
public static class PartitionByUid extends Partitioner<TextInt, Text> { @Override public int getPartition(Text key, Text value, int numPartitions) { // TODO Auto-generated method stub return (key.subString(0,8).hashCode() & Integer.MAX_VALUE) % numPartitions; } }
Re-run the job, no effect. This path won't work, because we don't know the data steps. (I used hive to analyze the data step by step .) Since partitioning does not work, considering that data is not cached in the reduce stage (at the beginning I did not know that iterator IN THE reduce stage can only traverse once), write the wrong code:
public static class UserChainSixReducer extends Reducer<Text,Text,Text,Text>{ private static String prefix1 = "tm"; private Text outKey = new Text(); List<String> lists = new ArrayList<String>(); Boolean flag = false; int count = 0; public void reduce(Text key,Iterable<Text> values,Context context)throws IOException, InterruptedException{ Iterator<Text> iter = values.iterator(); lists.clear(); flag = false; count = 0; while(iter.hasNext()){ String value = iter.next().toString(); if(value.contains(prefix1)){ flag = true; } } if(flag){ iter = values.iterator(); while(iter.hasNext()){ String value = iter.next().toString(); if(!value.contains(prefix1)){ count ++; if(0 == count % 1000){ context.progress(); Thread.sleep(1*1000); } outKey.set(value); context.write(outKey, null); } } } } }
It is released only once it is run, and reduce does not output the result. Google knows that the iterator cannot be iterated twice. The reason is that the reduce stage does not buffer the output of the map to the memory. If all data is cached in the memory, the memory overflow is very easy.
public boolean nextKeyValue() throws IOException, InterruptedException { if (!hasMore) { key = null; value = null; return false; } firstValue = !nextKeyIsSame; DataInputBuffer next = input.getKey(); currentRawKey.set(next.getData(), next.getPosition(), next.getLength() - next.getPosition()); buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength()); key = keyDeserializer.deserialize(key); next = input.getValue(); buffer.reset(next.getData(), next.getPosition(), next.getLength()); value = valueDeserializer.deserialize(value); hasMore = input.next(); if (hasMore) { next = input.getKey(); nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0, currentRawKey.getLength(), next.getData(), next .getPosition(), next.getLength() - next.getPosition()) == 0; } else { nextKeyIsSame = false; } inputValueCounter.increment(1L); return true; }
To reduce the input in the reduce stage and reduce the output in the map stage. In the map stage, the UID is divided into odd numbers and even numbers as the output of reduce to run the job. Therefore, reduce the input of reduce as much as possible by splitting the map output method.