一步一步跟我學習hadoop(7)----hadoop串連mysql資料庫執行資料讀寫資料庫操作

來源:互聯網
上載者:User

標籤:hadoop   mysql   map-reduce   匯入匯出mysql   

    為了方便 MapReduce 直接存取關係型資料庫(Mysql,Oracle),Hadoop提供了DBInputFormat和DBOutputFormat兩個類。通過DBInputFormat類把資料庫表資料讀入到HDFS,根據DBOutputFormat類把MapReduce產生的結果集匯入到資料庫表中。    運行MapReduce時候報錯:java.io.IOException: com.mysql.jdbc.Driver,一般是由於程式找不到mysql驅動包。解決方案是讓每個tasktracker運行MapReduce程式時都可以找到該驅動包。

添加包有兩種方式:

(1)在每個節點下的${HADOOP_HOME}/lib下添加該包。重啟叢集,一般是比較原始的方法。

(2)a)把包傳到叢集上: hadoop fs -put mysql-connector-java-5.1.0- bin.jar /hdfsPath/

       b)在mr程式提交job前,添加語句:DistributedCache.addFileToClassPath(new Path(“/hdfsPath/mysql- connector-java-5.1.0-bin.jar”),conf);

mysql資料庫儲存到hadoop hdfsmysql表建立和資料初始化
DROP TABLE IF EXISTS `wu_testhadoop`;CREATE TABLE `wu_testhadoop` (  `id` int(11) NOT NULL AUTO_INCREMENT,  `title` varchar(255) DEFAULT NULL,  `content` varchar(255) DEFAULT NULL,  PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8;-- ------------------------------ Records of wu_testhadoop-- ----------------------------INSERT INTO `wu_testhadoop` VALUES ('1', '123', '122312');INSERT INTO `wu_testhadoop` VALUES ('2', '123', '123456');
定義hadoop資料訪問

mysql表建立完畢後,我們需要定義hadoop訪問mysql的規則;

hadoop提供了org.apache.hadoop.io.Writable介面來實現簡單的高效的可序列化的協議,該類基於DataInput和DataOutput來實現相關的功能。

hadoop對資料庫訪問也提供了org.apache.hadoop.mapred.lib.db.DBWritable介面,其中write方法用於對PreparedStatement對象設定值,readFields方法用於對從資料庫讀取出來的對象進行列的值綁定;

以上兩個介面的使用如下(內容是從源碼得來)

writable
 public class MyWritable implements Writable {       // Some data            private int counter;       private long timestamp;              public void write(DataOutput out) throws IOException {         out.writeInt(counter);         out.writeLong(timestamp);       }              public void readFields(DataInput in) throws IOException {         counter = in.readInt();         timestamp = in.readLong();       }              public static MyWritable read(DataInput in) throws IOException {         MyWritable w = new MyWritable();         w.readFields(in);         return w;       }     } 

DBWritable
public class MyWritable implements Writable, DBWritable {   // Some data        private int counter;   private long timestamp;          //Writable#write() implementation   public void write(DataOutput out) throws IOException {     out.writeInt(counter);     out.writeLong(timestamp);   }          //Writable#readFields() implementation   public void readFields(DataInput in) throws IOException {     counter = in.readInt();     timestamp = in.readLong();   }          public void write(PreparedStatement statement) throws SQLException {     statement.setInt(1, counter);     statement.setLong(2, timestamp);   }          public void readFields(ResultSet resultSet) throws SQLException {     counter = resultSet.getInt(1);     timestamp = resultSet.getLong(2);   }  }
資料庫對應的實現
package com.wyg.hadoop.mysql.bean;import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;import java.sql.PreparedStatement;import java.sql.ResultSet;import java.sql.SQLException;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.Writable;import org.apache.hadoop.mapred.lib.db.DBWritable;public class DBRecord implements Writable, DBWritable{private int id;private String title;private String content;public int getId() {return id;}public void setId(int id) {this.id = id;}public String getTitle() {return title;}public void setTitle(String title) {this.title = title;}public String getContent() {return content;}public void setContent(String content) {this.content = content;}@Overridepublic void readFields(ResultSet set) throws SQLException {this.id = set.getInt("id");this.title = set.getString("title");this.content = set.getString("content");}@Overridepublic void write(PreparedStatement pst) throws SQLException {pst.setInt(1, id);pst.setString(2, title);pst.setString(3, content);}@Overridepublic void readFields(DataInput in) throws IOException {this.id = in.readInt();this.title = Text.readString(in);this.content = Text.readString(in);}@Overridepublic void write(DataOutput out) throws IOException {out.writeInt(this.id);Text.writeString(out, this.title);Text.writeString(out, this.content);}@Overridepublic String toString() { return this.id + " " + this.title + " " + this.content;  }}

實現Map/Reduce

package com.wyg.hadoop.mysql.mapper;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reporter;import com.wyg.hadoop.mysql.bean.DBRecord;@SuppressWarnings("deprecation")public class DBRecordMapper extends MapReduceBase implements Mapper<LongWritable, DBRecord, LongWritable, Text>{@Overridepublic void map(LongWritable key, DBRecord value,OutputCollector<LongWritable, Text> collector, Reporter reporter)throws IOException {collector.collect(new LongWritable(value.getId()), new Text(value.toString()));  }}

測試hadoop串連mysql並將資料存放區到hdfs

package com.wyg.hadoop.mysql.db;import java.io.IOException;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.lib.IdentityReducer;import org.apache.hadoop.mapred.lib.db.DBConfiguration;import org.apache.hadoop.mapred.lib.db.DBInputFormat;import com.wyg.hadoop.mysql.bean.DBRecord;import com.wyg.hadoop.mysql.mapper.DBRecordMapper;public class DBAccess {      public static void main(String[] args) throws IOException {             JobConf conf = new JobConf(DBAccess.class);             conf.setOutputKeyClass(LongWritable.class);             conf.setOutputValueClass(Text.class);             conf.setInputFormat(DBInputFormat.class);             Path path = new Path("hdfs://192.168.44.129:9000/user/root/dbout");             FileOutputFormat.setOutputPath(conf, path);             DBConfiguration.configureDB(conf,"com.mysql.jdbc.Driver", "jdbc:mysql://你的ip:3306/資料庫名","使用者名稱","密碼");             String [] fields = {"id", "title", "content"};             DBInputFormat.setInput(conf, DBRecord.class, "wu_testhadoop",                        null, "id", fields);             conf.setMapperClass(DBRecordMapper.class);             conf.setReducerClass(IdentityReducer.class);             JobClient.runJob(conf);      }}

執行程式,結果如下:

15/08/11 16:46:18 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=15/08/11 16:46:18 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.15/08/11 16:46:18 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).15/08/11 16:46:19 INFO mapred.JobClient: Running job: job_local_000115/08/11 16:46:19 INFO mapred.MapTask: numReduceTasks: 115/08/11 16:46:19 INFO mapred.MapTask: io.sort.mb = 10015/08/11 16:46:19 INFO mapred.MapTask: data buffer = 79691776/9961472015/08/11 16:46:19 INFO mapred.MapTask: record buffer = 262144/32768015/08/11 16:46:19 INFO mapred.MapTask: Starting flush of map output15/08/11 16:46:19 INFO mapred.MapTask: Finished spill 015/08/11 16:46:19 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting15/08/11 16:46:19 INFO mapred.LocalJobRunner: 15/08/11 16:46:19 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.15/08/11 16:46:19 INFO mapred.LocalJobRunner: 15/08/11 16:46:19 INFO mapred.Merger: Merging 1 sorted segments15/08/11 16:46:19 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 48 bytes15/08/11 16:46:19 INFO mapred.LocalJobRunner: 15/08/11 16:46:19 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting15/08/11 16:46:19 INFO mapred.LocalJobRunner: 15/08/11 16:46:19 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now15/08/11 16:46:19 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://192.168.44.129:9000/user/root/dbout15/08/11 16:46:19 INFO mapred.LocalJobRunner: reduce > reduce15/08/11 16:46:19 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.15/08/11 16:46:20 INFO mapred.JobClient:  map 100% reduce 100%15/08/11 16:46:20 INFO mapred.JobClient: Job complete: job_local_000115/08/11 16:46:20 INFO mapred.JobClient: Counters: 1415/08/11 16:46:20 INFO mapred.JobClient:   FileSystemCounters15/08/11 16:46:20 INFO mapred.JobClient:     FILE_BYTES_READ=3460615/08/11 16:46:20 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=6984415/08/11 16:46:20 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=3015/08/11 16:46:20 INFO mapred.JobClient:   Map-Reduce Framework15/08/11 16:46:20 INFO mapred.JobClient:     Reduce input groups=215/08/11 16:46:20 INFO mapred.JobClient:     Combine output records=015/08/11 16:46:20 INFO mapred.JobClient:     Map input records=215/08/11 16:46:20 INFO mapred.JobClient:     Reduce shuffle bytes=015/08/11 16:46:20 INFO mapred.JobClient:     Reduce output records=215/08/11 16:46:20 INFO mapred.JobClient:     Spilled Records=415/08/11 16:46:20 INFO mapred.JobClient:     Map output bytes=4215/08/11 16:46:20 INFO mapred.JobClient:     Map input bytes=215/08/11 16:46:20 INFO mapred.JobClient:     Combine input records=015/08/11 16:46:20 INFO mapred.JobClient:     Map output records=215/08/11 16:46:20 INFO mapred.JobClient:     Reduce input records=2


同時可以看到hdfs檔案系統多了一個dbout的目錄,裡邊的檔案儲存了資料庫對應的資料,內容儲存如下

11 123 12231222 123 123456

hdfs資料匯入到mysql

    hdfs檔案儲存體到mysql,也需要上邊的DBRecord類作為輔助,因為資料庫的操作都是通過DBInput和DBOutput來進行的;

    首先需要定義map和reduce的實現(map用以對hdfs的文檔進行解析,reduce解析map的輸出並輸出)

package com.wyg.hadoop.mysql.mapper;import java.io.IOException;import java.io.DataInput;import java.io.DataOutput;import java.sql.PreparedStatement;import java.sql.ResultSet;import java.sql.SQLException;import java.util.Iterator;import org.apache.hadoop.filecache.DistributedCache;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.Writable;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import com.wyg.hadoop.mysql.bean.DBRecord;public class WriteDB {    // Map處理過程    public static class Map extends MapReduceBase implements            Mapper<Object, Text, Text, DBRecord> {        private final static DBRecord one = new DBRecord();        private Text word = new Text();        @Override        public void map(Object key, Text value,            OutputCollector<Text, DBRecord> output, Reporter reporter)                throws IOException {            String line = value.toString();            String[] infos = line.split(" ");            String id = infos[0].split("")[1];            one.setId(new Integer(id));            one.setTitle(infos[1]);            one.setContent(infos[2]);            word.set(id);            output.collect(word, one);        }    }    public static class Reduce extends MapReduceBase implements    Reducer<Text, DBRecord, DBRecord, Text> {@Overridepublic void reduce(Text key, Iterator<DBRecord> values,OutputCollector<DBRecord, Text> collector, Reporter reporter)throws IOException {DBRecord record = values.next();    collector.collect(record, new Text());}}}
測試hdfs匯入資料到資料庫

package com.wyg.hadoop.mysql.db;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.TextInputFormat;import org.apache.hadoop.mapred.lib.db.DBConfiguration;import org.apache.hadoop.mapred.lib.db.DBInputFormat;import org.apache.hadoop.mapred.lib.db.DBOutputFormat;import com.wyg.hadoop.mysql.bean.DBRecord;import com.wyg.hadoop.mysql.mapper.WriteDB;public class DBInsert {public static void main(String[] args) throws Exception {         JobConf conf = new JobConf(WriteDB.class);        // 設定輸入輸出類型        conf.setInputFormat(TextInputFormat.class);        conf.setOutputFormat(DBOutputFormat.class);        // 不加這兩句,通不過,但是網上給的例子沒有這兩句。        //Text, DBRecord        conf.setMapOutputKeyClass(Text.class);        conf.setMapOutputValueClass(DBRecord.class);        conf.setOutputKeyClass(Text.class);        conf.setOutputValueClass(DBRecord.class);        // 設定Map和Reduce類        conf.setMapperClass(WriteDB.Map.class);        conf.setReducerClass(WriteDB.Reduce.class);        // 設定輸如目錄        FileInputFormat.setInputPaths(conf, new Path("hdfs://192.168.44.129:9000/user/root/dbout"));        // 建立資料庫連接        DBConfiguration.configureDB(conf,"com.mysql.jdbc.Driver", "jdbc:mysql://資料庫ip:3306/資料庫名稱","使用者名稱","密碼");        String[] fields = {"id","title","content" };        DBOutputFormat.setOutput(conf, "wu_testhadoop", fields);        JobClient.runJob(conf);    }}

測試結果如下

15/08/11 18:10:15 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=15/08/11 18:10:15 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.15/08/11 18:10:15 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).15/08/11 18:10:15 INFO mapred.FileInputFormat: Total input paths to process : 115/08/11 18:10:15 INFO mapred.JobClient: Running job: job_local_000115/08/11 18:10:15 INFO mapred.FileInputFormat: Total input paths to process : 115/08/11 18:10:15 INFO mapred.MapTask: numReduceTasks: 115/08/11 18:10:15 INFO mapred.MapTask: io.sort.mb = 10015/08/11 18:10:15 INFO mapred.MapTask: data buffer = 79691776/9961472015/08/11 18:10:15 INFO mapred.MapTask: record buffer = 262144/32768015/08/11 18:10:15 INFO mapred.MapTask: Starting flush of map output15/08/11 18:10:16 INFO mapred.MapTask: Finished spill 015/08/11 18:10:16 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting15/08/11 18:10:16 INFO mapred.LocalJobRunner: hdfs://192.168.44.129:9000/user/root/dbout/part-00000:0+3015/08/11 18:10:16 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.15/08/11 18:10:16 INFO mapred.LocalJobRunner: 15/08/11 18:10:16 INFO mapred.Merger: Merging 1 sorted segments15/08/11 18:10:16 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 40 bytes15/08/11 18:10:16 INFO mapred.LocalJobRunner: 15/08/11 18:10:16 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting15/08/11 18:10:16 INFO mapred.LocalJobRunner: reduce > reduce15/08/11 18:10:16 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.15/08/11 18:10:16 INFO mapred.JobClient:  map 100% reduce 100%15/08/11 18:10:16 INFO mapred.JobClient: Job complete: job_local_000115/08/11 18:10:16 INFO mapred.JobClient: Counters: 1415/08/11 18:10:16 INFO mapred.JobClient:   FileSystemCounters15/08/11 18:10:16 INFO mapred.JobClient:     FILE_BYTES_READ=3493215/08/11 18:10:16 INFO mapred.JobClient:     HDFS_BYTES_READ=6015/08/11 18:10:16 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=7069415/08/11 18:10:16 INFO mapred.JobClient:   Map-Reduce Framework15/08/11 18:10:16 INFO mapred.JobClient:     Reduce input groups=215/08/11 18:10:16 INFO mapred.JobClient:     Combine output records=015/08/11 18:10:16 INFO mapred.JobClient:     Map input records=215/08/11 18:10:16 INFO mapred.JobClient:     Reduce shuffle bytes=015/08/11 18:10:16 INFO mapred.JobClient:     Reduce output records=215/08/11 18:10:16 INFO mapred.JobClient:     Spilled Records=415/08/11 18:10:16 INFO mapred.JobClient:     Map output bytes=3415/08/11 18:10:16 INFO mapred.JobClient:     Map input bytes=3015/08/11 18:10:16 INFO mapred.JobClient:     Combine input records=015/08/11 18:10:16 INFO mapred.JobClient:     Map output records=215/08/11 18:10:16 INFO mapred.JobClient:     Reduce input records=2

測試之前我對原有表進行了清空處理,可以看到執行後資料庫裡邊添加了兩條內容;

下次在執行的時候會報錯,屬於正常情況,原因在於我們匯入資料的時候對id進行賦值了,如果忽略id,是可以一直添加的;

源碼

源碼已上傳,為download.csdn.net/detail/wuyinggui10000/8974585



一步一步跟我學習hadoop(7)----hadoop串連mysql資料庫執行資料讀寫資料庫操作

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.