標籤:c style class blog code java
- 經過前幾天的學習,基本上能夠小試牛刀編寫一些小程式玩一玩了,在此之前做幾項準備工作
- 明白我要用hadoop幹什麼
- 大體學習一下mapreduce
- ubuntu重新啟動後,再啟動hadoop會報串連異常的問題
- 資料提煉、探索資料、挖掘資料
- map=切碎,reduce=合并
- 重新啟動後會清空tmp目錄,預設namenode會存在這裡,須要在core-site.xml檔案裡添加(別忘了建立目錄,沒許可權的話,須要用root建立並把許可權改成777):
<property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop/tmp</value></property>
- 大資料,我的第一反應是現有關係型資料庫中的資料怎麼跟hadoop結合使用,網上搜了一些資料,使用的是DBInputFormat,那就簡單編寫一個從資料庫讀取資料,然後經過處理後,組建檔案的小範例吧
- 資料庫弄的簡單一點吧,id是數值整型、test是字串型,需求非常easy,統計TEST欄位出現的數量
import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;import java.sql.PreparedStatement;import java.sql.ResultSet;import java.sql.SQLException;import org.apache.hadoop.io.Writable;import org.apache.hadoop.mapreduce.lib.db.DBWritable;public class DBRecoder implements Writable, DBWritable{String test;int id;@Overridepublic void write(DataOutput out) throws IOException {out.writeUTF(test);out.writeInt(id);}@Overridepublic void readFields(DataInput in) throws IOException {test = in.readUTF();id = in.readInt();}@Overridepublic void readFields(ResultSet arg0) throws SQLException {test = arg0.getString("test");id = arg0.getInt("id");}@Overridepublic void write(PreparedStatement arg0) throws SQLException {arg0.setString(1, test);arg0.setInt(2, id);}}
import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;import org.apache.hadoop.mapreduce.lib.db.DBInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class DataCountTest {public static class TokenizerMapper extends Mapper<LongWritable, DBRecoder, Text, IntWritable> {public void map(LongWritable key, DBRecoder value, Context context) throws IOException, InterruptedException {context.write(new Text(value.test), new IntWritable(1));}}public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {private IntWritable result = new IntWritable();public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}result.set(sum);context.write(key, result);}}public static void main(String[] args) throws Exception {args = new String[1];args[0] = "hdfs://192.168.203.137:9000/user/chenph/output1111221";Configuration conf = new Configuration(); DBConfiguration.configureDB(conf, "oracle.jdbc.driver.OracleDriver", "jdbc:oracle:thin:@192.168.101.179:1521:orcl", "chenph", "chenph"); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();Job job = new Job(conf, "DB count");job.setJarByClass(DataCountTest.class);job.setMapperClass(TokenizerMapper.class);job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); String[] fields1 = { "id", "test"}; DBInputFormat.setInput(job, DBRecoder.class, "t1", null, "id", fields1); FileOutputFormat.setOutputPath(job, new Path(otherArgs[0]));System.exit(job.waitForCompletion(true) ? 0 : 1);}}
--------------------------------------------------------------------------------------------------開發過程中遇到的問題:
- Job被標記為已作廢,那應該用什麼我還沒有查到
- 亂碼問題,hadoop預設是utf8格式的,假設讀取的是gbk的須要進行處理
- 這類範例網上挺少的,有也是老版的,新版的資料沒有,我全然是拼湊出來的,非常多地方還不甚瞭解,須要進一步學習官方資料
- 搜尋資料時,有資料說不建議採用這樣的方式處理實際的大資料問題,原因就是並發過高,會瞬間秒殺掉資料庫,一般都會採用導成文字檔的形式