Help your classmates with a little corpus today. The corpus is a bit large, and it is a paragraph mark with two consecutive newline characters, and he wants to divide it into multiple small files, that is, every 3 paragraphs form a new file. Having never encountered a similar operation before, I found some similar methods on the internet and looked a bit complicated. So after trying to write a piece of code, the perfect solution to the problem.
The basic idea is to read the original file content, and use regular expressions, according to \ n for slicing, the result is a list, where each list element holds a slice of the content, and then create a handle to write the file, and then traverse the slice list, and write the current slice content, Determine if 3 paragraphs have been written, if not, continue to read and write the next slice, if it is 3, close the previous write file handle, recreate a new write file handle with a different file name, loop over, and wait for the next slice to read and write.
#-*-Coding:utf8-*-Import Re;p=re.compile (' \ n ', Re. S); filecontent=Open(' files/office-TXT ',' R ', encoding=' UTF8 ').Read();#读文件内容Paralist=p.Split(filecontent)#根据换行符对文本进行切片Filewriter=Open(' Files/0.txt ',' A ', encoding=' UTF8 ');#创建一个写文件的句柄 forParaindex in range (len (paralist)):#遍历切片后的文本列表FileWriter.Write(Paralist[paraindex]);#先将列表中第一个元素写入文件中 if((paraindex+1)%3==0):#判断是否写够3个切片, if that's enough.FileWriter.Close();#关闭当前句柄Filewriter=Open(' files/'+str ((paraindex+1)/3)+'. txt ',' A ', encoding=' UTF8 ');#重新创建一个新的句柄, waits for the next slice element to be written. Note The handling techniques for file names here. FileWriter.Close();#关闭最后创建的那个写文件句柄Print(' finished ');
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Python simple way to separate a large file into multiple small files by paragraph