Use Map reduce to filter Big Data

Source: Internet
Author: User

Problem Introduction: if we look for more than 20 billion records from 200 records (about 100 Gb) without considering the cluster computing power, we can write mapreduce as follows: the data size is not directly considered. The reduce stage filters rows at a time.

public static class UserChainSixMapper extends Mapper<LongWritable, Text, Text, Text>     {          private static String prefix1 = "tm";          private Text outKey = new Text();          private Text outVal = new Text();                   public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{               String path =  ((FileSplit)context.getInputSplit()).getPath().toString();                             if(path.contains("userchain")){                    String [] vals = value.toString().split(",");                    if(vals.length == 8){                         if(UserChainFirst.isUid(vals[0]) && UserChainFirst.isUid(vals[1])){                              outKey.set(vals[0]);                              outVal.set(value);                              context.write(outKey, outVal);                         }                    }               }               else if(path.contains("userid")){                    String val = value.toString();                    if(UserChainFirst.isUid(val)){                         outKey.set(val);                         outVal.set(prefix1);                         context.write(outKey, outVal);                    }               }          }     }public static class UserChainSixReducer extends Reducer<Text,Text,Text,Text>{          private static String prefix1 = "tm";          private Text outKey = new Text();          Boolean flag = false;          int count = 0;          List<String> lists = new ArrayList<String>();                   public void reduce(Text key,Iterable<Text> values,Context context)throws IOException, InterruptedException{               Iterator<Text> iter = values.iterator();               flag = false;               count = 0;               lists.clear();                             while(iter.hasNext()){                    String value = iter.next().toString();//                    System.out.println("key:" + key +"," + "value:" + value);                         if(value.contains(prefix1)){                              flag = true;                         }                    else{                         lists.add(value);                    }               }                             if(flag){                    for(String s : lists){                         count ++;                         if(0 == count % 1000){                              context.progress();                              Thread.sleep(1*1000);                         }                         outKey.set(s);                         context.write(outKey, null);                    }               }               else{                    lists.clear();               }          }     }}
In a job, it is found that when it reaches 99.6%, the job will not be able to pass, and one of the reduce jobs will not be able to pass. Memory overflow. Considering that too much data is cached to the memory in the reduce stage, lists. Add (value); causes memory overflow, the job cannot run. If you think about the unbalanced keys in the analysis, the load of a reduce is too heavy. Therefore, you should use partition for partitioning, but you do not know the step-by-step status of the UID key, use the first 8-bit uid hashcode partition (the UID is 10-bit or 11-bit ).
public static class PartitionByUid extends               Partitioner<TextInt, Text> {          @Override          public int getPartition(Text key, Text value,                    int numPartitions) {               // TODO Auto-generated method stub               return (key.subString(0,8).hashCode() & Integer.MAX_VALUE)                         % numPartitions;          }     }

Re-run the job, no effect. This path won't work, because we don't know the data steps. (I used hive to analyze the data step by step .) Since partitioning does not work, considering that data is not cached in the reduce stage (at the beginning I did not know that iterator IN THE reduce stage can only traverse once), write the wrong code:

public static class UserChainSixReducer extends Reducer<Text,Text,Text,Text>{          private static String prefix1 = "tm";          private Text outKey = new Text();          List<String> lists = new ArrayList<String>();          Boolean flag = false;          int count = 0;                    public void reduce(Text key,Iterable<Text> values,Context context)throws IOException, InterruptedException{               Iterator<Text> iter = values.iterator();               lists.clear();               flag = false;               count = 0;                              while(iter.hasNext()){                    String value = iter.next().toString();                    if(value.contains(prefix1)){                         flag = true;                    }               }                              if(flag){                    iter = values.iterator();                    while(iter.hasNext()){                         String value = iter.next().toString();                         if(!value.contains(prefix1)){                              count ++;                              if(0 == count % 1000){                                   context.progress();                                   Thread.sleep(1*1000);                              }                              outKey.set(value);                              context.write(outKey, null);                         }                    }               }                         }     }

It is released only once it is run, and reduce does not output the result. Google knows that the iterator cannot be iterated twice. The reason is that the reduce stage does not buffer the output of the map to the memory. If all data is cached in the memory, the memory overflow is very easy.
 public boolean nextKeyValue() throws IOException, InterruptedException     {          if (!hasMore)          {               key = null;               value = null;               return false;          }          firstValue = !nextKeyIsSame;          DataInputBuffer next = input.getKey();          currentRawKey.set(next.getData(), next.getPosition(), next.getLength()                    - next.getPosition());          buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength());          key = keyDeserializer.deserialize(key);          next = input.getValue();          buffer.reset(next.getData(), next.getPosition(), next.getLength());          value = valueDeserializer.deserialize(value);          hasMore = input.next();          if (hasMore)          {               next = input.getKey();               nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0,                         currentRawKey.getLength(), next.getData(), next                                   .getPosition(), next.getLength()                                   - next.getPosition()) == 0;          }          else          {               nextKeyIsSame = false;          }          inputValueCounter.increment(1L);          return true;     }
To reduce the input in the reduce stage and reduce the output in the map stage. In the map stage, the UID is divided into odd numbers and even numbers as the output of reduce to run the job. Therefore, reduce the input of reduce as much as possible by splitting the map output method.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.