A Collection of Offline Data Analysis Problems Based on Hadoop and Hive
Source: Internet
Author: User
Keywordshadoop hive offline data analysis
1. Merge small files. The architecture of nginx+flume+hdfs collects and stores logs, but the logs collected by flume will eventually have a lot of small files stored in hdfs. It is uncomfortable that hdfs is not suitable for processing a large number of small files, but fortunately, the mapreduce provided by
hadoop In order to merge small files in batches, here is the dry goods code directly:
package baobei.data.etl;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
/**
* Flume will generate multiple files to hdfs, here is set to aggregate small files into one large file,
* Facilitate the subsequent data processing
*
*/
public class SmallFileCombiner {
The
static class SmallFileCombinerMapper extends Mapper<LongWritable, Text, Text, NullWritable>{
NullWritable v = NullWritable.get();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//This way is equivalent to printing the value directly to the disk file. value is actually the content of every file
context.write(value, v);
}
The
}
The
/**
* If there are too many small files in the production environment, then the cumulative number is also very large, then the size of the slice must be set at this time.
*
* Ready to use: CombineTextInputFormat.setMaxInputSplitSize(job, 1024*1024*150);
*/
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(SmallFileCombiner.class);
job.setMapperClass(SmallFileCombinerMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
//The following method divides the small file into a slice.
job.setInputFormatClass(CombineTextInputFormat.class);
//If the sum of small files is 224M, when the second parameter in setMaxInputSplitSize is set to 300M, in
//hdfs://master:9000/output only one part-m-00000 file will be generated
//If the second parameter in setMaxInputSplitSize is set to 150M, in
//hdfs://master:9000/output will generate two files, part-m-00000 and part-m-00001
CombineTextInputFormat.setMaxInputSplitSize(job, 1024*1024*150);
CombineTextInputFormat.setInputPaths(job, new Path("hdfs://maoyunchao:9000/in/"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://maoyunchao:9000/output/"));
// job.setNumReduceTasks(0);
job.waitForCompletion(true);
}
}
2. The cluster memory is insufficient. Expand the cluster and add Datanodes.
3. When writing a shell script, it is best to add source/etc/profile in front of the script, so that it will identify the environment variables, otherwise when using crontab to execute the hive -f, hive -e and other instructions in the script regularly There is no such instruction when an error is reported.
4. Use hive -e'' followed by single quotation marks in the terminal, but the statements following hive -e "" in the shell script need to use double quotation marks.
5. An error is reported when sqoop exports hive data to mysql. check the manual that corresponds to your MySQL server version for the right syntax to use near. After analyzing and querying, it was found that the version reported by mysql jar. At first, I used version 5.1.6. It is better to use version 5.1.32 of mysql jar package, which is more stable and free of bugs.
6. The query speed is too slow. 12 machines (16G+8 cores), 14 Gs, and a data volume of about 20 million. It takes about 40 minutes at the beginning of the business to complete all operations. Finally, after various optimizations, such as: JVM reuse, job parallel execution 2. Shutdown of speculative execution, setting of the number of reduce, data skew and salt optimization (dual group random suffix operation), join optimization (reduce join into map join, SMB join), the final execution time can be completed in about 1 hour. The slowest execution speed is that the data of the external link analysis is tilted, which takes about 15 minutes, and about 10 minutes after optimization.
7. Add the database name eg: select * from baobei.pv in front of the table when writing the hql statement, otherwise it will go to the default default library to find the table.
8. Write some udf and some configuration to the .hiverc file under $HIVE_HOME/bin, and it will be executed by default every time the hive statement is executed.
9. Sometimes some zombie tasks are used to kill during debugging. You can use
hadoop job -kill jobID and yarn application -kill ApplicationId to kill mapreduce tasks.
10. When using orcFile, parquet and other storage methods, directly create a table and then import the data, so that an error will be reported during the query, and some data must be queried from another table to insert. eg: insert overwrite table xxx select… can not directly use load data inpath ….
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.