問題引入:假設從200億條記錄中(大約200G)尋找100多條其中的記錄,不考慮叢集的計算能力,我們可以這樣寫mapreduce: 直接不考慮資料量大小,reduce階段一次行過濾。
public static class UserChainSixMapper extends Mapper<LongWritable, Text, Text, Text> { private static String prefix1 = "tm"; private Text outKey = new Text(); private Text outVal = new Text(); public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{ String path = ((FileSplit)context.getInputSplit()).getPath().toString(); if(path.contains("userchain")){ String [] vals = value.toString().split(","); if(vals.length == 8){ if(UserChainFirst.isUid(vals[0]) && UserChainFirst.isUid(vals[1])){ outKey.set(vals[0]); outVal.set(value); context.write(outKey, outVal); } } } else if(path.contains("userid")){ String val = value.toString(); if(UserChainFirst.isUid(val)){ outKey.set(val); outVal.set(prefix1); context.write(outKey, outVal); } } } }public static class UserChainSixReducer extends Reducer<Text,Text,Text,Text>{ private static String prefix1 = "tm"; private Text outKey = new Text(); Boolean flag = false; int count = 0; List<String> lists = new ArrayList<String>(); public void reduce(Text key,Iterable<Text> values,Context context)throws IOException, InterruptedException{ Iterator<Text> iter = values.iterator(); flag = false; count = 0; lists.clear(); while(iter.hasNext()){ String value = iter.next().toString();// System.out.println("key:" + key +"," + "value:" + value); if(value.contains(prefix1)){ flag = true; } else{ lists.add(value); } } if(flag){ for(String s : lists){ count ++; if(0 == count % 1000){ context.progress(); Thread.sleep(1*1000); } outKey.set(s); context.write(outKey, null); } } else{ lists.clear(); } } }}
在一個job中,發現在跑到99.6%的時候,job就過不去了,其中一個reduce過不去。記憶體溢出。考慮到在reduce階段緩衝到記憶體的資料過多 lists.add(value);造成記憶體溢出,job根本跑不過去。分析問題想到其中的key分步不均造成一個reduce的負載過重,所以考慮用partition進行分區,但是不瞭解uid key的分步情況,猜用前8位uid hashcode分區(uid為10位或11位)。
public static class PartitionByUid extends Partitioner<TextInt, Text> { @Override public int getPartition(Text key, Text value, int numPartitions) { // TODO Auto-generated method stub return (key.subString(0,8).hashCode() & Integer.MAX_VALUE) % numPartitions; } }
重新跑job,沒有效果。這條路行不通,因為我們不知道其中的資料分步情況。(曾考慮用hive分析一下資料分步情況。)既然分區這條路行不通,考慮到在reduce階段不快取資料,(剛開始我並不知道reduce階段的iterator只能遍曆一遍),寫下錯誤的代碼:
public static class UserChainSixReducer extends Reducer<Text,Text,Text,Text>{ private static String prefix1 = "tm"; private Text outKey = new Text(); List<String> lists = new ArrayList<String>(); Boolean flag = false; int count = 0; public void reduce(Text key,Iterable<Text> values,Context context)throws IOException, InterruptedException{ Iterator<Text> iter = values.iterator(); lists.clear(); flag = false; count = 0; while(iter.hasNext()){ String value = iter.next().toString(); if(value.contains(prefix1)){ flag = true; } } if(flag){ iter = values.iterator(); while(iter.hasNext()){ String value = iter.next().toString(); if(!value.contains(prefix1)){ count ++; if(0 == count % 1000){ context.progress(); Thread.sleep(1*1000); } outKey.set(value); context.write(outKey, null); } } } } }
跑了一遍才發行,reduce並沒有輸出結果。google了一下才知道iterator並不能迭代2次,其中的原因reduce階段不是把所以的map的輸出緩衝到記憶體中的,其實想想就應該知道。如果都緩衝到記憶體中,資料大很容易記憶體溢出。
public boolean nextKeyValue() throws IOException, InterruptedException { if (!hasMore) { key = null; value = null; return false; } firstValue = !nextKeyIsSame; DataInputBuffer next = input.getKey(); currentRawKey.set(next.getData(), next.getPosition(), next.getLength() - next.getPosition()); buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength()); key = keyDeserializer.deserialize(key); next = input.getValue(); buffer.reset(next.getData(), next.getPosition(), next.getLength()); value = valueDeserializer.deserialize(value); hasMore = input.next(); if (hasMore) { next = input.getKey(); nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0, currentRawKey.getLength(), next.getData(), next .getPosition(), next.getLength() - next.getPosition()) == 0; } else { nextKeyIsSame = false; } inputValueCounter.increment(1L); return true; }
在想想,想減少reduce階段的輸入,在map階段減少輸出。於是就有了在map階段把uid分成奇數,偶數分別作為reduce的輸出,去跑job。所以要盡量減少reduce的輸入,可以通過拆分map輸出的方法。