Processing a text dataset with Python3 __python

Source: Internet
Author: User
Tags readline

To do machine learning, you need to have data first. Data set processing is the basis of machine learning. This article will show you how to use Python's own CSV module for some simple dataset processing. Task

Processes the specified semeval file into a. csv file in the prescribed format.

The seminal file previews the following figure:

Each row data set consists of three parts-ordinal, emotion, and text; the part and part are separated by the TAB key, and each part inner text is separated by a space.
The results of the. csv file required to be implemented in Python are as follows:

algorithm

The algorithm to solve this problem is relatively simple. First we output the topmost header, then we read the contents of each line in turn, then we divide the three parts by \ t, and then divide the middle emotion part by a space and a colon, and then take the following value. The final output can be as required. Code Implementation

For beginners, to solve the problem of the code implementation is still a certain degree of difficulty, we need to Python3 file read and write, string processing and CSV module has a certain understanding.

First look at the final code:

Import CSV #print the Header:with open ("New.csv", "W") as Csvfile:fileheader = ["id", "text", ' All ', ' anger ', ' DISG Ust ', ' fear ', \ ' joy ', ' sad ', ' surprise '] writer = csv.writer (csvfile) Writer.writerow (fileheade R) with the Open ('/path/semeval ', ' R ') as Inputfile:currentlinecontent = Inputfile.readline () while (Currentlineconte
        NT): part = Currentlinecontent.split (' t ') #split in 3 Parts:num = part[0] Emo = part[1]
        Text = part[2] #spilt different Emotions:emo = Emo.split (') Csvall = Emo[0].split (': ') [1]  Anger = Emo[1].split (': ') [1] disgust = Emo[2].split (': ') [1] fear = Emo[3].split (': ') [1] Joy = Emo[4].split (': ') [1] sad = Emo[5].split (': ') [1] surprise = Emo[6].split (': ') [1] with open ("new.c SV "," a ") as Csvfile:dict_writer = csv. Dictwriter (CSVFile, Fileheader) Dict_writer.writerow ({"id": num, "text": Text, ' All ': Csvall,\ ' anger ': anger, ' disgust ':d isgust, ' fear ': \ F Ear, ' joy ': joy, ' sad ': sad, ' surprise ': \ surprise} currentlinecontent = Inputfil


 E.readline ()
reading Files

Let's take a look at the part that reads the file:

With open ('/users/youzunzhi/desktop/semeval ', ' R ') as Inputfile:
    currentlinecontent = Inputfile.readline ()
    while (currentlinecontent): ...
        Currentlinecontent = Inputfile.readline ()

The open () function with Python can open a file object:
f = open ('/path/file.txt ', ' R ')
where ' R ' indicates that the read file mode is opened. In addition, you can also use: ' W ': Write file Mode ' A ': Patterns added after the original file

When you open a file in this manner, the file object will always occupy the resources of the operating system, so close () method closes the file:
F.close ()

Because such a method is too cumbersome and easy to forget, Python provides the With statement to automatically call the Close () method:

With open ('/path/file.txt ', ' R ') as F: Pass
    

Once you have opened the file, you can manipulate the file. If the file is very small, you can read directly with read (), but if the file is large, the memory will explode directly, so here we use the ReadLine () method to read the row by line, until the end of the read, while the condition will be false. String Handling

Let's look at the part of string processing:

    Part = Currentlinecontent.split (' t ')

    #split in 3 parts:
        num = part[0]
        emo = part[1]
        text = part[2]

    # Spilt different emotions:
        emo = emo.split (')
        Csvall = Emo[0].split (': ') [1]
        anger = Emo[1].split (': ') [1]< C9/>disgust = Emo[2].split (': ') [1]
        fear = Emo[3].split (': ') [1]
        joy = emo[4].split (': ') [1]
        sad = emo[5]. Split (': ') [1]
        surprise = Emo[6].split (': ') [1]

Python3 provides a split () function to split the string. After using the function will return a list, so after the \ t split, we first assign each of the three elements in the resulting list to Num,emo, and text respectively. Then continue to split by space and colon, and finally the second value of the segmented list is assigned to each emotion. csv Text Output

Finally, we use Python3 's own CSV module to export the dataset as required to the. csv file.

Import CSV

#print the header: with
open ("New.csv", "W") as CSVFile:
    fileheader = ["id", "text", ' All ', ' anger ' , ' disgust ', ' fear ', \
                  ' joy ', ' sad ', ' surprise ']
    writer = csv.writer (csvfile)
    Writer.writerow (Fileheader)
...
With open ("New.csv", "a") as CSVFile:
    dict_writer = csv. Dictwriter (CSVFile, Fileheader)
    Dict_writer.writerow ({"id": num, "text": Text, ' All ': csvall,\
                                  ' anger ': Anger, ' disgust ':d isgust, ' fear ':
                                  fear, ' joy ': joy, ' sad ': sad, ' surprise ': ' Surprise
                                  } '

First we output the header using the Writerow method in the writer function in CSV.

Writer.writerow (Fileheader)

For the following values, although the output can also be in order, but here we introduce a dictwriter function, which in the dictionary to the specified key output value, without worrying about the order problem.

Dict_writer = csv. Dictwriter (CSVFile, Fileheader)

The Dictwriter function also has a Writerow method, and the argument is a dictionary.

Dict_writer.writerow ({"id": num, "text": Text, ' All ': csvall,\
                                  ' anger ': anger, ' disgust ':d isgust, ' fear ': \
                                  Fear, ' joy ': joy, ' sad ': sad, ' surprise ': \
                                  Surprise}

At this point, the task of processing the text dataset with Python3 and outputting it in CSV format is complete.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.