Data Loading storage and file format for data analysis using python,

Source: Internet
Author: User

Data Loading storage and file format for data analysis using python,

Before learning, we need to install the pandas module. Since the python version I installed is 2.7Https://pypi.python.org/pypi/pandas/0.16.2/#downloadsDownload version 0.16.2 from this website, decompress it, and use the DOS command to open the corresponding file, and runPython setup. py installInstallation may cause an error:Error: Microsoft Visual C ++ 9.0 is required (Unable to find vcvarsall. bat). Get it from http://aka.ms/vcpython27Now goHttp://www.microsoft.com/en-us/download/confirmation.aspx? Id = 1, 44266,Microsoft Visual C ++ Compiler for Python 2.7 is automatically downloaded and installed. Then run python setup. py install, and the installation process is about 30 s. After the installation is successful, you can go to the idle window

import pandas

Check whether the reference is successful. After successful, you can start the next step.

The input and output data is generally divided into several categories, reading text files and other more efficient disk storage formats, loading data in the database, or using web APIs to operate on network data resources.

I. Reading text format data

Pandas provides functions to read table data as a dataframe object.

Read_csv Loads data with delimiters from files, URLs, and file objects. The default Delimiter is comma.
Read_table Load data with delimiters from files, URLs, and file objects. The default Delimiter is a tab ('\ t ').
Read_fwf Reads data in a specified width format with no Separator
Read_clipboard Read data from the clipboard

 

 

  

  

 

 

Read_csv assigns a default column name for the data. You can also specify a column name for the data, such as pd. read_csv ('ch06/ex2/csv', names = ['A', 'B', 'C', 'D', 'message'])

If you want to use the message column as the dataframe index, you can use the index_col parameter to specify the message:

Names = ['A', 'B', 'C', 'D', 'message']

Pd. read_csv ('ch06/ex2/csv', names = names, index_col = 'message ')

 

Write Data to text format

1. Using the to_csv method of data_frame, you can write data to a file separated by commas. You can also use the sep parameter to specify the delimiter, such as data. to_csv ()

2. When the missing value is written to the output, it is expressed as an empty string. You can use na_rep to represent it as another flag value.

Manually process separator format

For any single-character separator file, you can directly use the python built-in csv module to pass any opened file or file-type objects to csv. reader:

import csvf=open('ch06/ex7.csv')reader=csv.reader(f)

This reader iteration will generate a list for each row. In order to make the data meet the requirements, you need to perform some manual sorting:

lines=list(csv.reader(open('ch06\ex7.csv')))header,values=lines[0],lines[1:]data_dict={h:v for h,v in zip(header,zip(*values))}

There are many csv files. You only need to define a subclass of csv. dialect to define the new format:

class my_dialect(csv.Dialect):    lineterminator='\n'    delimiter=';'    quotechar='"'reader=csv.reader(f,dialect=my_dialect)

Ii. JSON data

JSON data has become one of the standard formats for sending data between the wed browser and other applications through http requests. It is a more flexible data format than the table text format.

JSON is very similar to valid python code. The basic types include objects, arrays, strings, values, Boolean and null. Json. loads can be used to convert a JSON string to a python string.

import jsonresult=json.loads(obj)

Json. dump converts a python object to JSON format.

Iii. XML and HML: WEB Information Collection

Lxmlcan efficiently upload large files. lxmlhas multiple programming interfaces. First, we use lxml.html to process HTML, and then use lxml. objectid for some XML processing.

(To be continued)

Iv. binary data format

One of the simplest ways to store binary data formats is to use python's built-in pickle serialization. pandas objects all have a save method for saving data to disks in the form of pickle, then, use the pickle function pandas. load reads data back to python:

frame=pd.read_csv('ch06/ec1.csv')frame.save('ch06/frame_pickle')
frame.load('ch06/frame_pickle')

Use HDF5 format

HDF5 refers to a hierarchical data format. Each HDF5 file contains a file-system node structure, which allows you to store multiple datasets and support metadata. HDF5 supports real-time compression of multiple compressors.

Python has two interfaces for processing HDF5, pytable, and h5py.

Read excel files

Pandas's excelfile class supports reading table-type data stored in excel, since excelfile uses xlrd and openpyxl packages, You have to install them first (https://pypi.python.org/pypi/xlrd ), you can create an excelfile instance by passing in an xls or xslx file path. data stored in a worksheet can be read to dataframe through parse.

xls_file=pd.ExcelFile('data.xls')table=xls_file.parse('Sheet1')

5. Use HTML and WEB APIs

Many websites have public APIs that provide data in JSON or other formats. to access these Apis through python, we recommend the requests package as follows:

After reading the webpage information, you can perform more advanced processing.

import requestsurl='http://www.baidu.com'resp=requests.get(url)respimport jsondata=json.loads(resp.text)

Vi. database usage

In specific applications, data is rarely taken from text data. More sources and databases (including relational databases and non-relational databases)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.