標籤:style blog http io ar color os 使用 sp
本篇隨筆記錄如何匯入google-cluster-data-2011-1-2的
job_events和task_events到MySQL
1. 下載資料
download_job_events:
import urllib2url = ‘https://commondatastorage.googleapis.com/clusterdata-2011-2/‘f = open(‘C:\\SHA256SUM‘)l = f.readlines()f.close()for i in l: if i.count(‘job_events‘)>0: fileAddr = i.split()[1][1:] fileName = fileAddr.split(‘/‘)[1] print ‘downloading‘, fileName data = urllib2.urlopen(url+fileAddr).read() print ‘saving‘, fileName fileDown = open(‘C:\\job_events\\‘+fileName, ‘wb‘) fileDown.write(data) fileDown.close()
download_task_events:
import urllib2url = ‘https://commondatastorage.googleapis.com/clusterdata-2011-2/‘f = open(‘C:\\SHA256SUM‘)l = f.readlines()f.close()for i in l: if i.count(‘task_events‘)>0: fileAddr = i.split()[1][1:] fileName = fileAddr.split(‘/‘)[1] print ‘downloading‘, fileName data = urllib2.urlopen(url+fileAddr).read() print ‘saving‘, fileName fileDown = open(‘C:\\task_events\\‘+fileName, ‘wb‘) fileDown.write(data) fileDown.close()
注意:這次用的資料是
clusterdata-2011-2
不同於之前重畫GoogleCLusterData中的
clusterdata-2011-1
2. 解壓縮
由於不能直接匯入壓縮包裡的資料到mysql,故先將它們解壓縮
unzip_job_events:
import gzipimport osfileNames = os.listdir(‘C:\\task_events‘)for l in fileNames: print ‘now at: ‘+ l f = gzip.open(‘C:\\job_events\\‘+l) fOut = open(‘C:\\job_events_unzip\\‘+l[:-3], ‘w‘) content = f.read() fOut.write(content) f.close() fOut.close() #raw_input()
unzip_task_events:
import gzipimport osfileNames = os.listdir(‘C:\\task_events‘)for l in fileNames: print ‘now at: ‘+ l f = gzip.open(‘C:\\task_events\\‘+l) fOut = open(‘C:\\task_events_unzip\\‘+l[:-3], ‘w‘) content = f.read() fOut.write(content) f.close() fOut.close()
3. 建資料庫
create_job_events:
create table job_events(time bigint,missing_info int,job_id bigint,event_type int,user text,scheduling_class int,job_name text,logical_job_name text)engine = myisam;
create_task_events:
create table task_events(time bigint,missing_info int,job_id bigint,task_index bigint,machine_id bigint,event_type int,user text,scheduling_class int,priority int,cpu_request float,memory_request float,disk_space_request float,difference_machine_restriction boolean)engine = myisam;
注意:由於資料量非常大,這裡一定要選擇myisam作為engine。
4. 匯入資料
由於資料中有部分為空白的值,需要先設定mysql使其能夠匯入空值。
具體方法為:
在mysql的控制台輸入
SET @@GLOBAL.sql_mode="NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION";
之後就可以開始匯入資料了。
loadJobEvents2MySQL.py
import osimport MySQLdbfileNames = os.listdir(‘C:\\task_events_unzip‘)conn=MySQLdb.connect(host="localhost",user="root",passwd="123456",db="googleclusterdata",charset="utf8")cursor = conn.cursor()cursor.execute(‘truncate job_events‘)for f in fileNames: print ‘now at: ‘+ f order = "load data infile ‘C:/job_events_unzip/%s‘ into table job_events fields terminated by ‘,‘ lines terminated by ‘\n‘" %f print order cursor.execute(order) conn.commit()
loadTaskEvents2MySQL.py
import osimport MySQLdbfileNames = os.listdir(‘C:\\task_events_unzip‘)conn=MySQLdb.connect(host="localhost",user="root",passwd="123456",db="googleclusterdata",charset="utf8")cursor = conn.cursor()cursor.execute(‘truncate task_events‘)for f in fileNames: print ‘now at: ‘+ f order = "load data infile ‘C:/task_events_unzip/%s‘ into table task_events fields terminated by ‘,‘ lines terminated by ‘\n‘" %f print order cursor.execute(order) conn.commit()
注意:這裡需要相應的修改密碼和使用的資料庫名(db)
匯入GoogleClusterData到MySQL