A Collection of Offline Data Analysis Problems Based on Hadoop and Hive

Last Update:2020-06-23 Source: Internet

Author: User

Keywords hadoop hive offline data analysis

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Merge small files. The architecture of nginx+flume+hdfs collects and stores logs, but the logs collected by flume will eventually have a lot of small files stored in hdfs. It is uncomfortable that hdfs is not suitable for processing a large number of small files, but fortunately, the mapreduce provided by hadoop In order to merge small files in batches, here is the dry goods code directly:

package baobei.data.etl;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * Flume will generate multiple files to hdfs, here is set to aggregate small files into one large file,
 * Facilitate the subsequent data processing
 *
 */
public class SmallFileCombiner {
The
static class SmallFileCombinerMapper extends Mapper<LongWritable, Text, Text, NullWritable>{
NullWritable v = NullWritable.get();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//This way is equivalent to printing the value directly to the disk file. value is actually the content of every file
context.write(value, v);
}
The
}
The
/**
* If there are too many small files in the production environment, then the cumulative number is also very large, then the size of the slice must be set at this time.
*
* Ready to use: CombineTextInputFormat.setMaxInputSplitSize(job, 1024*1024*150);
*/
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(SmallFileCombiner.class);
job.setMapperClass(SmallFileCombinerMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
//The following method divides the small file into a slice.
job.setInputFormatClass(CombineTextInputFormat.class);
//If the sum of small files is 224M, when the second parameter in setMaxInputSplitSize is set to 300M, in
//hdfs://master:9000/output only one part-m-00000 file will be generated
//If the second parameter in setMaxInputSplitSize is set to 150M, in
//hdfs://master:9000/output will generate two files, part-m-00000 and part-m-00001
CombineTextInputFormat.setMaxInputSplitSize(job, 1024*1024*150);
CombineTextInputFormat.setInputPaths(job, new Path("hdfs://maoyunchao:9000/in/"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://maoyunchao:9000/output/"));
// job.setNumReduceTasks(0);
job.waitForCompletion(true);
}
}

2. The cluster memory is insufficient. Expand the cluster and add Datanodes.
3. When writing a shell script, it is best to add source/etc/profile in front of the script, so that it will identify the environment variables, otherwise when using crontab to execute the hive -f, hive -e and other instructions in the script regularly There is no such instruction when an error is reported.
4. Use hive -e'' followed by single quotation marks in the terminal, but the statements following hive -e "" in the shell script need to use double quotation marks.
5. An error is reported when sqoop exports hive data to mysql. check the manual that corresponds to your MySQL server version for the right syntax to use near. After analyzing and querying, it was found that the version reported by mysql jar. At first, I used version 5.1.6. It is better to use version 5.1.32 of mysql jar package, which is more stable and free of bugs.
6. The query speed is too slow. 12 machines (16G+8 cores), 14 Gs, and a data volume of about 20 million. It takes about 40 minutes at the beginning of the business to complete all operations. Finally, after various optimizations, such as: JVM reuse, job parallel execution 2. Shutdown of speculative execution, setting of the number of reduce, data skew and salt optimization (dual group random suffix operation), join optimization (reduce join into map join, SMB join), the final execution time can be completed in about 1 hour. The slowest execution speed is that the data of the external link analysis is tilted, which takes about 15 minutes, and about 10 minutes after optimization.
7. Add the database name eg: select * from baobei.pv in front of the table when writing the hql statement, otherwise it will go to the default default library to find the table.
8. Write some udf and some configuration to the .hiverc file under $HIVE_HOME/bin, and it will be executed by default every time the hive statement is executed.
9. Sometimes some zombie tasks are used to kill during debugging. You can use hadoop job -kill jobID and yarn application -kill ApplicationId to kill mapreduce tasks.
10. When using orcFile, parquet and other storage methods, directly create a table and then import the data, so that an error will be reported during the query, and some data must be queried from another table to insert. eg: insert overwrite table xxx select… can not directly use load data inpath ….

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More