Use Map reduce to filter Big Data

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Problem Introduction: if we look for more than 20 billion records from 200 records (about 100 Gb) without considering the cluster computing power, we can write mapreduce as follows: the data size is not directly considered. The reduce stage filters rows at a time.

public static class UserChainSixMapper extends Mapper<LongWritable, Text, Text, Text>     {          private static String prefix1 = "tm";          private Text outKey = new Text();          private Text outVal = new Text();                   public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{               String path =  ((FileSplit)context.getInputSplit()).getPath().toString();                             if(path.contains("userchain")){                    String [] vals = value.toString().split(",");                    if(vals.length == 8){                         if(UserChainFirst.isUid(vals[0]) && UserChainFirst.isUid(vals[1])){                              outKey.set(vals[0]);                              outVal.set(value);                              context.write(outKey, outVal);                         }                    }               }               else if(path.contains("userid")){                    String val = value.toString();                    if(UserChainFirst.isUid(val)){                         outKey.set(val);                         outVal.set(prefix1);                         context.write(outKey, outVal);                    }               }          }     }public static class UserChainSixReducer extends Reducer<Text,Text,Text,Text>{          private static String prefix1 = "tm";          private Text outKey = new Text();          Boolean flag = false;          int count = 0;          List<String> lists = new ArrayList<String>();                   public void reduce(Text key,Iterable<Text> values,Context context)throws IOException, InterruptedException{               Iterator<Text> iter = values.iterator();               flag = false;               count = 0;               lists.clear();                             while(iter.hasNext()){                    String value = iter.next().toString();//                    System.out.println("key:" + key +"," + "value:" + value);                         if(value.contains(prefix1)){                              flag = true;                         }                    else{                         lists.add(value);                    }               }                             if(flag){                    for(String s : lists){                         count ++;                         if(0 == count % 1000){                              context.progress();                              Thread.sleep(1*1000);                         }                         outKey.set(s);                         context.write(outKey, null);                    }               }               else{                    lists.clear();               }          }     }}

In a job, it is found that when it reaches 99.6%, the job will not be able to pass, and one of the reduce jobs will not be able to pass. Memory overflow. Considering that too much data is cached to the memory in the reduce stage, lists. Add (value); causes memory overflow, the job cannot run. If you think about the unbalanced keys in the analysis, the load of a reduce is too heavy. Therefore, you should use partition for partitioning, but you do not know the step-by-step status of the UID key, use the first 8-bit uid hashcode partition (the UID is 10-bit or 11-bit ).

public static class PartitionByUid extends               Partitioner<TextInt, Text> {          @Override          public int getPartition(Text key, Text value,                    int numPartitions) {               // TODO Auto-generated method stub               return (key.subString(0,8).hashCode() & Integer.MAX_VALUE)                         % numPartitions;          }     }

Re-run the job, no effect. This path won't work, because we don't know the data steps. (I used hive to analyze the data step by step .) Since partitioning does not work, considering that data is not cached in the reduce stage (at the beginning I did not know that iterator IN THE reduce stage can only traverse once), write the wrong code:

public static class UserChainSixReducer extends Reducer<Text,Text,Text,Text>{          private static String prefix1 = "tm";          private Text outKey = new Text();          List<String> lists = new ArrayList<String>();          Boolean flag = false;          int count = 0;                    public void reduce(Text key,Iterable<Text> values,Context context)throws IOException, InterruptedException{               Iterator<Text> iter = values.iterator();               lists.clear();               flag = false;               count = 0;                              while(iter.hasNext()){                    String value = iter.next().toString();                    if(value.contains(prefix1)){                         flag = true;                    }               }                              if(flag){                    iter = values.iterator();                    while(iter.hasNext()){                         String value = iter.next().toString();                         if(!value.contains(prefix1)){                              count ++;                              if(0 == count % 1000){                                   context.progress();                                   Thread.sleep(1*1000);                              }                              outKey.set(value);                              context.write(outKey, null);                         }                    }               }                         }     }

It is released only once it is run, and reduce does not output the result. Google knows that the iterator cannot be iterated twice. The reason is that the reduce stage does not buffer the output of the map to the memory. If all data is cached in the memory, the memory overflow is very easy.

 public boolean nextKeyValue() throws IOException, InterruptedException     {          if (!hasMore)          {               key = null;               value = null;               return false;          }          firstValue = !nextKeyIsSame;          DataInputBuffer next = input.getKey();          currentRawKey.set(next.getData(), next.getPosition(), next.getLength()                    - next.getPosition());          buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength());          key = keyDeserializer.deserialize(key);          next = input.getValue();          buffer.reset(next.getData(), next.getPosition(), next.getLength());          value = valueDeserializer.deserialize(value);          hasMore = input.next();          if (hasMore)          {               next = input.getKey();               nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0,                         currentRawKey.getLength(), next.getData(), next                                   .getPosition(), next.getLength()                                   - next.getPosition()) == 0;          }          else          {               nextKeyIsSame = false;          }          inputValueCounter.increment(1L);          return true;     }

To reduce the input in the reduce stage and reduce the output in the map stage. In the map stage, the UID is divided into odd numbers and even numbers as the output of reduce to run the job. Therefore, reduce the input of reduce as much as possible by splitting the map output method.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Map reduce to filter Big Data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use Map reduce to filter Big Data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support