"Data analysis using Python" reading notes-data loading, storage and file formats

Source: Internet
Author: User

Input and output are generally divided into the following categories: Reading text files and other more efficient disk storage formats, loading data from the database. Use the Web API to manipulate network resources.

1. Read and write data in text format

I feel that reading and writing files sometimes "requires luck", often need to manually adjust. Python has become a popular language for text and file processing because of its simple file interaction syntax, intuitive data structures, and convenient features such as tuple packaging unpacking. Pandas provides some functions for reading tabular data as dataframe objects. See table below:

The following is a general introduction to some of the techniques that these functions have when converting text data to Dataframe. Can be divided into several categories:

    • Index: Treats one or more columns as returned dataframe, and whether to get column names from files or users.
    • Type inference and data transformations: include conversion of user-defined values, list of missing value tokens, and so on.
    • Date Resolution: Includes a combination of features, such as a single column in the composition results of a datetime information group that is scattered across multiple columns.
    • Iteration: Supports block-by-Chunk iterations of large files.
    • Irregular data problem: Skip some lines, footers, notes, or other unwanted things

Pandas reading a file automatically infers the data type without specifying it.

Read_csv as an example

Use names to redefine the column name, specify the index with Index_col, or you can combine multiple columns as a hierarchical index. You can write a regular expression to specify the delimiter character. Skips some rows with skiprows. Pandas will be marked with Na, -1. #IND, NULL, and so on. Use na_values for different na tag values. The following is the read_csv/read_table parameter:

Read a text file by block

When working with very large files, or finding the set of parameters in a large file for subsequent processing, you may want to read only part of the file or iterate over a block of files. nrows Specifies how many rows to read. To read a file by block, you need to set the chunksize (number of lines).

Write data out to text format

Write to the CSV file using the To_csv method. The parameter Sep indicates the delimiter. Na_rep A substitute value that indicates a blank string. The index header indicates whether to write the row and column labels, which is written by default. Use cols to limit and write out some columns in the specified order.

Series also has a To_csv method. With some sort of work (no header row, first column as index), you can read it as a Series with Read_csv, and of course there is a more convenient from_csv,series.from_csv.

Manually handle delimiter formats

Some wonderful files need to be processed and read later. Python's built-in CSV module can read any single character delimiter file. Pass the open file to Csv.reader. There are many forms of CSV files, just define CSV. A subclass of dialect can define new formats (such as specialized separators, string reference conventions, line terminators, and so on):

#-*-encoding:utf-8-*-ImportNumPy as NPImportOSImportPandas as PD fromPandasImportSeries,dataframeImportMatplotlib.pyplot as PltImportPandas.io.data as WebImportCSVF= Open ('Ex6.csv') Reader=Csv.reader (f) forLineinchReader:PrintLinelines= List (Csv.reader (open ('Ex7.csv')) Header,values= Lines[0],lines[1:]PrintHeaderPrintValues#the following * should be the meaning of the value taken outData_dict = {H:v forH,vinchZip (header,zip (*values)}Printdata_dictclassMy_dialect (CSV. dialect): LineTerminator='\ n'delimiter=';'QuoteChar='"'Reader= Csv.reader (f,dialect=my_dialect)#the parameters of the CSV branch can also be given in the form of parametersReader = Csv.reader (F,delimiter ='|')

For files that use complex separators or multi-character separators, the CSV file is powerless. In this case, split or re.split is used for splitting and finishing work. To output the delimiter file manually, you can use Csv.writer. It accepts an open and writable file object and the same language support and formatting options as the Cav.reader.

#-*-encoding:utf-8-*-ImportNumPy as NPImportOSImportPandas as PD fromPandasImportSeries,dataframeImportMatplotlib.pyplot as PltImportPandas.io.data as WebImportCsvwith Open ('Mydata.csv','W') as F:writer= Csv.writer (F,lineterminator ='\ n') Writer.writerow (' One',' Both','three')) Writer.writerow ('1','2','3'))

JSON data

In addition to the null value null and some other nuances (such as the absence of extra commas at the end of the list), JSON is very close to the valid Python code. Basic data types have objects (dictionaries), arrays (lists), strings, numeric values, Booleans, and null. All keys in an object must be strings (very important). With the JSON module, Json.loads can convert the string into Python, which means that the object can be read as a Python dictionary.

Conversely, Json.dumps can convert Python objects to JSON format.

XML and Html:web information collection

lxml can read and process XML and HTML-formatted data. This part of the time to study again.

2. Binary data file

One of the simplest ways to implement the binary format of data is to use Python's built-in pickle serialization. For ease of use, the Pandas object has a Save method that is used to save the data to disk in pickle form.

"Data analysis using Python" reading notes-data loading, storage and file formats

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.