Recently bought a "Python treasure" in the read, this book tells the breadth of Python knowledge, but the depth is slightly insufficient, so more suitable for beginners and improve the level of readers. One of the Python Big Data processing chapter of the content of interest, after reading, I based on the book provided in the case of the source code has been modified, but also to achieve the simulation of mapreduce process.
Goal: Count the number of page resource accesses from the Apache user access log Access.log. Let's assume that the file size is huge.
Information structure in Access: 66.249.68.43--[04/aug/2011:01:06:48 +0800] "GET/page address http/1.1" 200 4100
Meaning: 66.249.68.43 source IP, [04/aug/2011:01:06:48 +0800] datetime, "GET/page address http/1.1" source mode, 200 return status, 4100 return byte number
Step: 1, the large files are cut into multiple small files, 2, calculate the number of pages in each small file access (map process), each small file corresponding to a calculation result file, 3, each small file in the calculation of the results of the combined accumulation, statistics of the final page resource access, The results are saved to reduceResult.txt.
File structure:
Wherein, Access.log is the original log file, Smallfiles Save the split set of small files, mapfiles in each small file corresponding to the processing result file, ReduceResult.txt save the final processing results.
The following is the source code:
"' Created on 2014-12-19@author:guoxiyue@file:0filesplit.py@function: File Split ' ' Import os,os.path,time;sourcefile= Open (' Files/access.log ', ' R ', encoding= ' UTF8 '); #打开原始大文件targetDir = ' files/smallfiles '; # Set small files to save directory smallfilesize=30; #设置每个小文件中的记录条数tempList =[]; #临时列表, Filenum=1 for recording content in small files; # Small file ordinal readline=sourcefile.readline (); #先读一行while (readLine): #循环 linenum=1 while (linenum<=smallfilesize): #控制每个小文件的记录条数不超过设定值 Templist.append (ReadLine); #将当前行的读取结果保存到临时列表中 readline=sourcefile.readline (); #再读下一行 linenum+=1;# line number self-increment if not readline:break;# If read empty, then the large file is read, exit the inner loop tempfile=open (' Files/smallfiles/access_ ' +str (filenum) + '. txt ', ' w ', encoding= ' UTF8 '); Save the read 30 records to a file tempfile.writelines (templist); Tempfile.close (); Templist=[]; #清空临时列表 print (' Files/smallfiles/access_ ' +str (filenum) + '. txt is created in ' +str (Time.asctime ())); Filenum+=1; #文件序号自增sourceFile. Close ();
The "' Created on 2014-12-19@author:guoxiyue@file:1map.py@function:map process, respectively, calculates the number of page resource accesses in each small file" ' Import Os,os.path,re, Time;sourcefilelist=os.listdir (' files/smallfiles/'); #获取所有小文件文件名列表targetDir = "files/mapfiles/"; #设置处理结果保存目录for eachfile in Sourcefilelist: #遍历小文件 currentfile=open (' files/smallfiles/' +eachfile, ' r ', encoding= ' UTF8 ' ); #打开小文件 Currentline=currentfile.readline (); #先读一行 tempdict={}; #临时字典 while (currentline): P_re=re.compile ("(get| POST) \s (. *?) \shttp ", Re. IGNORECASE); #用正则表达式来提取访问资源 Match=p_re.findall (CurrentLine); #从当前行中提取出访问资源 if (match): url=match[0][1]; #提出资源页面 if URL in tempdict: #获取当前资源的访问次数, and add tempdict[url]+=1 to the dictionary; Else:tempdict[url]=1; Currentline=currentfile.readline (); #再读下一行 Currentfile.close (); #以下是将当前小文件的统计结果排序并保存 tlist=[]; For Key,value in Sorted (Tempdict.items (), Key=lambda data:data[0],reverse=true): Tlist.append (key+ "+str (value)); TLiSt.append (' \ n ') Tempfile=open (Targetdir+eachfile, ' a ', encoding= ' UTF8 '); Tempfile.writelines (tList); Tempfile.close () print (targetdir+eachfile+ '. txt is created in ' +str (Time.asctime ()));
"' Created on 2014-12-19@author:guoxiyue@file:2reduce.py@function:reduce process, summarizes the final page resource traffic" ' Import os,os.path,re, Time;sourcefilelist=os.listdir (' files/mapfiles/'); #获取小文件的map结果文件名列表targetFile = ' files/reduceresult.txt '; # Set Final result save file tempdict={}; #临时字典 P_re=re.compile (' (. *?) (\d{1,}$) ', Re. IGNORECASE); #利用正则表达式抽取资源访问次数for eachfile in Sourcefilelist: #遍历map文件 currentfile=open (' files/mapfiles/' +eachfile, ' r ', encoding= ' UTF8 '); #打开当前文件 Currentline=currentfile.readline (); #读一行 while (currentline): Subdata=p_re.findall (CurrentLine) #提取出当前行中资源的访问次数 if (subdata[0][0] in tempdict): #将结果累加 Tempdict[subdata[0][0]]+=int (subdata[0][1]); Else:tempdict[subdata[0][0]]=int (subdata[0][1]); Currentline=currentfile.readline (); #再读一行 currentfile.close (); #以下是将所有map文件的统计结果排序并保存tList =[];for Key,value in Sorted (Tempdict.items (), Key=lambda data:data[1],reverse=true): Tlist.append (key+ ' +str (value)); Tlist.append (' \ n ') Tempfile=open (targetfile, ' a ', Encoding= ' UTF8 '); Tempfile.writelines (tList); Tempfile.close () print (targetfile+ ' created in ' +str (Time.asctime ()));
PYTHON3 Simulation MapReduce processing Analysis Big Data file--"Python treasure"