Help your classmates with a little corpus today.
The corpus is a bit large, and it is a paragraph mark with two consecutive newline characters, and he wants to separate it into multiple small files by paragraph. That is, each of the 3 paragraphs constitutes a new file. Because I've never had a similar operation, I've looked at some of the same things on the internet that seem a little complicated.
So after trying. Wrote a piece of code yourself. Perfect solution to this problem.
The basic idea is to read the original file content and use the regular form. According to \ n, the slicing process. The result is a list in which each list element holds the contents of a slice, and then creates a handle to the file, then iterates through the slice list and writes to the current slice, infers if 3 paragraphs have been written, assuming no, and continues to read and write the next slice, assuming that it is 3. Closes the previous write file handle, creates a new write file handle again with a different file name, loops over, and waits for the next slice to be read and written.
#-*-Coding:utf8-*-Import Re;p=re.compile (' \ n ', Re. S); filecontent=Open(' files/office-TXT ',' R ', encoding=' UTF8 ').Read();#读文件内容Paralist=p.Split(filecontent)#根据换行符对文本进行切片Filewriter=Open(' Files/0.txt ',' A ', encoding=' UTF8 ');#创建一个写文件的句柄 forParaindex in range (len (paralist)):#遍历切片后的文本列表FileWriter.Write(Paralist[paraindex]);#先将列表中第一个元素写入文件里 if((paraindex+1)%3==0):#推断是否写够3个切片, assuming it's enough.FileWriter.Close();#关闭当前句柄Filewriter=Open(' files/'+str ((paraindex+1)/3)+'. txt ',' A ', encoding=' UTF8 ');#又一次创建一个新的句柄. Waits for the next slice element to be written.Note The handling techniques for file names here.
fileWriter.close();#关闭最后创建的那个写文件句柄print(‘finished‘);
Python simple way to separate a large file into multiple small files by paragraph