Data Loading storage and file format for data analysis using python,

Last Update:2016-02-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Before learning, we need to install the pandas module. Since the python version I installed is 2.7Https://pypi.python.org/pypi/pandas/0.16.2/#downloadsDownload version 0.16.2 from this website, decompress it, and use the DOS command to open the corresponding file, and runPython setup. py installInstallation may cause an error:Error: Microsoft Visual C ++ 9.0 is required (Unable to find vcvarsall. bat). Get it from http://aka.ms/vcpython27Now goHttp://www.microsoft.com/en-us/download/confirmation.aspx? Id = 1, 44266,Microsoft Visual C ++ Compiler for Python 2.7 is automatically downloaded and installed. Then run python setup. py install, and the installation process is about 30 s. After the installation is successful, you can go to the idle window

import pandas

Check whether the reference is successful. After successful, you can start the next step.

The input and output data is generally divided into several categories, reading text files and other more efficient disk storage formats, loading data in the database, or using web APIs to operate on network data resources.

I. Reading text format data

Pandas provides functions to read table data as a dataframe object.

Read_csv	Loads data with delimiters from files, URLs, and file objects. The default Delimiter is comma.
Read_table	Load data with delimiters from files, URLs, and file objects. The default Delimiter is a tab ('\ t ').
Read_fwf	Reads data in a specified width format with no Separator
Read_clipboard	Read data from the clipboard

Read_csv assigns a default column name for the data. You can also specify a column name for the data, such as pd. read_csv ('ch06/ex2/csv', names = ['A', 'B', 'C', 'D', 'message'])

If you want to use the message column as the dataframe index, you can use the index_col parameter to specify the message:

Names = ['A', 'B', 'C', 'D', 'message']

Pd. read_csv ('ch06/ex2/csv', names = names, index_col = 'message ')

Write Data to text format

1. Using the to_csv method of data_frame, you can write data to a file separated by commas. You can also use the sep parameter to specify the delimiter, such as data. to_csv ()

2. When the missing value is written to the output, it is expressed as an empty string. You can use na_rep to represent it as another flag value.

Manually process separator format

For any single-character separator file, you can directly use the python built-in csv module to pass any opened file or file-type objects to csv. reader:

import csvf=open('ch06/ex7.csv')reader=csv.reader(f)

This reader iteration will generate a list for each row. In order to make the data meet the requirements, you need to perform some manual sorting:

lines=list(csv.reader(open('ch06\ex7.csv')))header,values=lines[0],lines[1:]data_dict={h:v for h,v in zip(header,zip(*values))}

There are many csv files. You only need to define a subclass of csv. dialect to define the new format:

class my_dialect(csv.Dialect):    lineterminator='\n'    delimiter=';'    quotechar='"'reader=csv.reader(f,dialect=my_dialect)

Ii. JSON data

JSON data has become one of the standard formats for sending data between the wed browser and other applications through http requests. It is a more flexible data format than the table text format.

JSON is very similar to valid python code. The basic types include objects, arrays, strings, values, Boolean and null. Json. loads can be used to convert a JSON string to a python string.

import jsonresult=json.loads(obj)

Json. dump converts a python object to JSON format.

Iii. XML and HML: WEB Information Collection

Lxmlcan efficiently upload large files. lxmlhas multiple programming interfaces. First, we use lxml.html to process HTML, and then use lxml. objectid for some XML processing.

(To be continued)

Iv. binary data format

One of the simplest ways to store binary data formats is to use python's built-in pickle serialization. pandas objects all have a save method for saving data to disks in the form of pickle, then, use the pickle function pandas. load reads data back to python:

frame=pd.read_csv('ch06/ec1.csv')frame.save('ch06/frame_pickle')
frame.load('ch06/frame_pickle')

Use HDF5 format

HDF5 refers to a hierarchical data format. Each HDF5 file contains a file-system node structure, which allows you to store multiple datasets and support metadata. HDF5 supports real-time compression of multiple compressors.

Python has two interfaces for processing HDF5, pytable, and h5py.

Read excel files

Pandas's excelfile class supports reading table-type data stored in excel, since excelfile uses xlrd and openpyxl packages, You have to install them first (https://pypi.python.org/pypi/xlrd ), you can create an excelfile instance by passing in an xls or xslx file path. data stored in a worksheet can be read to dataframe through parse.

xls_file=pd.ExcelFile('data.xls')table=xls_file.parse('Sheet1')

5. Use HTML and WEB APIs

Many websites have public APIs that provide data in JSON or other formats. to access these Apis through python, we recommend the requests package as follows:

After reading the webpage information, you can perform more advanced processing.

import requestsurl='http://www.baidu.com'resp=requests.get(url)respimport jsondata=json.loads(resp.text）

Vi. database usage

In specific applications, data is rarely taken from text data. More sources and databases (including relational databases and non-relational databases)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data Loading storage and file format for data analysis using python,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data Loading storage and file format for data analysis using python,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support