To do machine learning, you need to have data first. Data set processing is the basis of machine learning. This article will show you how to use Python's own CSV module for some simple dataset processing. Task
Processes the specified semeval file into a. csv file in the prescribed format.
The seminal file previews the following figure:
Each row data set consists of three parts-ordinal, emotion, and text; the part and part are separated by the TAB key, and each part inner text is separated by a space.
The results of the. csv file required to be implemented in Python are as follows:
algorithm
The algorithm to solve this problem is relatively simple. First we output the topmost header, then we read the contents of each line in turn, then we divide the three parts by \ t, and then divide the middle emotion part by a space and a colon, and then take the following value. The final output can be as required. Code Implementation
For beginners, to solve the problem of the code implementation is still a certain degree of difficulty, we need to Python3 file read and write, string processing and CSV module has a certain understanding.
First look at the final code:
Import CSV #print the Header:with open ("New.csv", "W") as Csvfile:fileheader = ["id", "text", ' All ', ' anger ', ' DISG Ust ', ' fear ', \ ' joy ', ' sad ', ' surprise '] writer = csv.writer (csvfile) Writer.writerow (fileheade R) with the Open ('/path/semeval ', ' R ') as Inputfile:currentlinecontent = Inputfile.readline () while (Currentlineconte
NT): part = Currentlinecontent.split (' t ') #split in 3 Parts:num = part[0] Emo = part[1]
Text = part[2] #spilt different Emotions:emo = Emo.split (') Csvall = Emo[0].split (': ') [1] Anger = Emo[1].split (': ') [1] disgust = Emo[2].split (': ') [1] fear = Emo[3].split (': ') [1] Joy = Emo[4].split (': ') [1] sad = Emo[5].split (': ') [1] surprise = Emo[6].split (': ') [1] with open ("new.c SV "," a ") as Csvfile:dict_writer = csv. Dictwriter (CSVFile, Fileheader) Dict_writer.writerow ({"id": num, "text": Text, ' All ': Csvall,\ ' anger ': anger, ' disgust ':d isgust, ' fear ': \ F Ear, ' joy ': joy, ' sad ': sad, ' surprise ': \ surprise} currentlinecontent = Inputfil
E.readline ()
reading Files
Let's take a look at the part that reads the file:
With open ('/users/youzunzhi/desktop/semeval ', ' R ') as Inputfile:
currentlinecontent = Inputfile.readline ()
while (currentlinecontent): ...
Currentlinecontent = Inputfile.readline ()
The open () function with Python can open a file object:
f = open ('/path/file.txt ', ' R ')
where ' R ' indicates that the read file mode is opened. In addition, you can also use: ' W ': Write file Mode ' A ': Patterns added after the original file
When you open a file in this manner, the file object will always occupy the resources of the operating system, so close () method closes the file:
F.close ()
Because such a method is too cumbersome and easy to forget, Python provides the With statement to automatically call the Close () method:
With open ('/path/file.txt ', ' R ') as F: Pass
Once you have opened the file, you can manipulate the file. If the file is very small, you can read directly with read (), but if the file is large, the memory will explode directly, so here we use the ReadLine () method to read the row by line, until the end of the read, while the condition will be false. String Handling
Let's look at the part of string processing:
Part = Currentlinecontent.split (' t ')
#split in 3 parts:
num = part[0]
emo = part[1]
text = part[2]
# Spilt different emotions:
emo = emo.split (')
Csvall = Emo[0].split (': ') [1]
anger = Emo[1].split (': ') [1]< C9/>disgust = Emo[2].split (': ') [1]
fear = Emo[3].split (': ') [1]
joy = emo[4].split (': ') [1]
sad = emo[5]. Split (': ') [1]
surprise = Emo[6].split (': ') [1]
Python3 provides a split () function to split the string. After using the function will return a list, so after the \ t split, we first assign each of the three elements in the resulting list to Num,emo, and text respectively. Then continue to split by space and colon, and finally the second value of the segmented list is assigned to each emotion. csv Text Output
Finally, we use Python3 's own CSV module to export the dataset as required to the. csv file.
Import CSV
#print the header: with
open ("New.csv", "W") as CSVFile:
fileheader = ["id", "text", ' All ', ' anger ' , ' disgust ', ' fear ', \
' joy ', ' sad ', ' surprise ']
writer = csv.writer (csvfile)
Writer.writerow (Fileheader)
...
With open ("New.csv", "a") as CSVFile:
dict_writer = csv. Dictwriter (CSVFile, Fileheader)
Dict_writer.writerow ({"id": num, "text": Text, ' All ': csvall,\
' anger ': Anger, ' disgust ':d isgust, ' fear ':
fear, ' joy ': joy, ' sad ': sad, ' surprise ': ' Surprise
} '
First we output the header using the Writerow method in the writer function in CSV.
Writer.writerow (Fileheader)
For the following values, although the output can also be in order, but here we introduce a dictwriter function, which in the dictionary to the specified key output value, without worrying about the order problem.
Dict_writer = csv. Dictwriter (CSVFile, Fileheader)
The Dictwriter function also has a Writerow method, and the argument is a dictionary.
Dict_writer.writerow ({"id": num, "text": Text, ' All ': csvall,\
' anger ': anger, ' disgust ':d isgust, ' fear ': \
Fear, ' joy ': joy, ' sad ': sad, ' surprise ': \
Surprise}
At this point, the task of processing the text dataset with Python3 and outputting it in CSV format is complete.