PYTHON3 Simulation MapReduce processing Analysis Big Data file--"Python treasure"

Last Update:2014-12-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently bought a "Python treasure" in the read, this book tells the breadth of Python knowledge, but the depth is slightly insufficient, so more suitable for beginners and improve the level of readers. One of the Python Big Data processing chapter of the content of interest, after reading, I based on the book provided in the case of the source code has been modified, but also to achieve the simulation of mapreduce process.

Goal: Count the number of page resource accesses from the Apache user access log Access.log. Let's assume that the file size is huge.

Information structure in Access: 66.249.68.43--[04/aug/2011:01:06:48 +0800] "GET/page address http/1.1" 200 4100

Meaning: 66.249.68.43 source IP, [04/aug/2011:01:06:48 +0800] datetime, "GET/page address http/1.1" source mode, 200 return status, 4100 return byte number

Step: 1, the large files are cut into multiple small files, 2, calculate the number of pages in each small file access (map process), each small file corresponding to a calculation result file, 3, each small file in the calculation of the results of the combined accumulation, statistics of the final page resource access, The results are saved to reduceResult.txt.

File structure:

Wherein, Access.log is the original log file, Smallfiles Save the split set of small files, mapfiles in each small file corresponding to the processing result file, ReduceResult.txt save the final processing results.

The following is the source code:

"' Created on 2014-12-19@author:guoxiyue@file:0filesplit.py@function: File Split ' ' Import os,os.path,time;sourcefile= Open (' Files/access.log ', ' R ', encoding= ' UTF8 '); #打开原始大文件targetDir = ' files/smallfiles '; # Set small files to save directory smallfilesize=30; #设置每个小文件中的记录条数tempList =[]; #临时列表, Filenum=1 for recording content in small files; # Small file ordinal readline=sourcefile.readline (); #先读一行while (readLine): #循环    linenum=1 while    (linenum<=smallfilesize): #控制每个小文件的记录条数不超过设定值        Templist.append (ReadLine); #将当前行的读取结果保存到临时列表中        readline=sourcefile.readline (); #再读下一行        linenum+=1;# line number self-increment        if not readline:break;# If read empty, then the large file is read, exit the inner loop    tempfile=open (' Files/smallfiles/access_ ' +str (filenum) + '. txt ', ' w ', encoding= ' UTF8 '); Save the read 30 records to a file    tempfile.writelines (templist);    Tempfile.close ();    Templist=[]; #清空临时列表    print (' Files/smallfiles/access_ ' +str (filenum) + '. txt is  created in ' +str (Time.asctime ()));    Filenum+=1; #文件序号自增sourceFile. Close ();

The "' Created on 2014-12-19@author:guoxiyue@file:1map.py@function:map process, respectively, calculates the number of page resource accesses in each small file" ' Import Os,os.path,re, Time;sourcefilelist=os.listdir (' files/smallfiles/'); #获取所有小文件文件名列表targetDir = "files/mapfiles/"; #设置处理结果保存目录for eachfile in Sourcefilelist: #遍历小文件 currentfile=open (' files/smallfiles/' +eachfile, ' r ', encoding= ' UTF8 ' ); #打开小文件 Currentline=currentfile.readline (); #先读一行 tempdict={}; #临时字典 while (currentline): P_re=re.compile ("(get| POST) \s (. *?) \shttp ", Re. IGNORECASE); #用正则表达式来提取访问资源 Match=p_re.findall (CurrentLine); #从当前行中提取出访问资源 if (match): url=match[0][1];             #提出资源页面 if URL in tempdict: #获取当前资源的访问次数, and add tempdict[url]+=1 to the dictionary;        Else:tempdict[url]=1; Currentline=currentfile.readline ();        #再读下一行 Currentfile.close ();    #以下是将当前小文件的统计结果排序并保存 tlist=[];        For Key,value in Sorted (Tempdict.items (), Key=lambda data:data[0],reverse=true): Tlist.append (key+ "+str (value)); TLiSt.append (' \ n ') Tempfile=open (Targetdir+eachfile, ' a ', encoding= ' UTF8 ');    Tempfile.writelines (tList);     Tempfile.close () print (targetdir+eachfile+ '. txt is created in ' +str (Time.asctime ()));

"' Created on 2014-12-19@author:guoxiyue@file:2reduce.py@function:reduce process, summarizes the final page resource traffic" ' Import os,os.path,re, Time;sourcefilelist=os.listdir (' files/mapfiles/'); #获取小文件的map结果文件名列表targetFile = ' files/reduceresult.txt '; # Set Final result save file tempdict={}; #临时字典 P_re=re.compile (' (. *?) (\d{1,}$) ', Re. IGNORECASE); #利用正则表达式抽取资源访问次数for eachfile in Sourcefilelist: #遍历map文件 currentfile=open (' files/mapfiles/' +eachfile, ' r ', encoding= ' UTF8 '); #打开当前文件 Currentline=currentfile.readline ();  #读一行 while (currentline): Subdata=p_re.findall (CurrentLine) #提取出当前行中资源的访问次数 if (subdata[0][0] in tempdict):        #将结果累加 Tempdict[subdata[0][0]]+=int (subdata[0][1]);        Else:tempdict[subdata[0][0]]=int (subdata[0][1]); Currentline=currentfile.readline (); #再读一行 currentfile.close ();    #以下是将所有map文件的统计结果排序并保存tList =[];for Key,value in Sorted (Tempdict.items (), Key=lambda data:data[1],reverse=true):    Tlist.append (key+ ' +str (value)); Tlist.append (' \ n ') Tempfile=open (targetfile, ' a ', Encoding= ' UTF8 '); Tempfile.writelines (tList); Tempfile.close () print (targetfile+ ' created in ' +str (Time.asctime ()));

PYTHON3 Simulation MapReduce processing Analysis Big Data file--"Python treasure"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

PYTHON3 Simulation MapReduce processing Analysis Big Data file--"Python treasure"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

PYTHON3 Simulation MapReduce processing Analysis Big Data file--"Python treasure"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support