Python separates a large file into multiple small files by paragraphs,
Today, I want to help students with some corpus. The corpus file is a little large and uses two consecutive linefeeds as a paragraph sign. He wants to separate it into multiple small files by paragraph, that is, every three paragraphs form a new file. Since I have never encountered similar operations before, I found some similar methods on the Internet, and they all look a little complicated. So after trying, I wrote a piece of code to solve the problem perfectly.
The basic idea is to first read the content of the original file and use a regular expression to perform Slice Processing Based on \ n. The result is a list, where each list element stores the content of a slice; create a Write File handle. Next, traverse the slice list and write the current slice content to determine whether three paragraphs have been written. If not, continue to read and write the next slice, if there are already three Write File handles, close the previous Write File handles and re-create a new Write File handle with different file names. The loop ends, waiting for the next Shard to be read and written.
#-*-Coding: utf8-*-import re; p = re. compile ('\ n \ n', re. s); fileContent = open ('files/office .txt ', 'R', encoding = 'utf8 '). read (); # read the file content paraList = p. split (fileContent) # segment the text based on the linefeed fileWriter = open ('files/0.txt ', 'A', encoding = 'utf8 '); # create a Write File handle for paraIndex in range (len (paraList): # traverse the sliced text list fileWriter. write (paraList [paraIndex]); # first write the first element in the List into the file if (paraIndex + 1) % 3 = 0 ): # determine whether to write three slices. If fileWriter is enough. close (); # close when Front handle fileWriter = open ('files/'+ str (paraIndex + 1)/3100000000'.txt', 'A', encoding = 'utf8 '); # re-create a new handle and wait for writing the next slice element. Note the file name processing skills. FileWriter. close (); # close the last created Write File handle print ('finished ');
Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.