Data Loading storage and file format for data analysis using python,
Before learning, we need to install the pandas module. Since the python version I installed is 2.7Https://pypi.python.org/pypi/pandas/0.16.2/#downloadsDownload version 0.16.2 from this website, decompress it, and use the DOS command to open the corresponding file, and runPython setup. py installInstallation may cause an error:Error: Microsoft Visual C ++ 9.0 is required (Unable to find vcvarsall. bat). Get it from http://aka.ms/vcpython27Now goHttp://www.microsoft.com/en-us/download/confirmation.aspx? Id = 1, 44266,Microsoft Visual C ++ Compiler for Python 2.7 is automatically downloaded and installed. Then run python setup. py install, and the installation process is about 30 s. After the installation is successful, you can go to the idle window
import pandas
Check whether the reference is successful. After successful, you can start the next step.
The input and output data is generally divided into several categories, reading text files and other more efficient disk storage formats, loading data in the database, or using web APIs to operate on network data resources.
I. Reading text format data
Pandas provides functions to read table data as a dataframe object.
Read_csv |
Loads data with delimiters from files, URLs, and file objects. The default Delimiter is comma. |
Read_table |
Load data with delimiters from files, URLs, and file objects. The default Delimiter is a tab ('\ t '). |
Read_fwf |
Reads data in a specified width format with no Separator |
Read_clipboard |
Read data from the clipboard |
Read_csv assigns a default column name for the data. You can also specify a column name for the data, such as pd. read_csv ('ch06/ex2/csv', names = ['A', 'B', 'C', 'D', 'message'])
If you want to use the message column as the dataframe index, you can use the index_col parameter to specify the message:
Names = ['A', 'B', 'C', 'D', 'message']
Pd. read_csv ('ch06/ex2/csv', names = names, index_col = 'message ')
Write Data to text format
1. Using the to_csv method of data_frame, you can write data to a file separated by commas. You can also use the sep parameter to specify the delimiter, such as data. to_csv ()
2. When the missing value is written to the output, it is expressed as an empty string. You can use na_rep to represent it as another flag value.
Manually process separator format
For any single-character separator file, you can directly use the python built-in csv module to pass any opened file or file-type objects to csv. reader:
import csvf=open('ch06/ex7.csv')reader=csv.reader(f)
This reader iteration will generate a list for each row. In order to make the data meet the requirements, you need to perform some manual sorting:
lines=list(csv.reader(open('ch06\ex7.csv')))header,values=lines[0],lines[1:]data_dict={h:v for h,v in zip(header,zip(*values))}
There are many csv files. You only need to define a subclass of csv. dialect to define the new format:
class my_dialect(csv.Dialect): lineterminator='\n' delimiter=';' quotechar='"'reader=csv.reader(f,dialect=my_dialect)
Ii. JSON data
JSON data has become one of the standard formats for sending data between the wed browser and other applications through http requests. It is a more flexible data format than the table text format.
JSON is very similar to valid python code. The basic types include objects, arrays, strings, values, Boolean and null. Json. loads can be used to convert a JSON string to a python string.
import jsonresult=json.loads(obj)
Json. dump converts a python object to JSON format.
Iii. XML and HML: WEB Information Collection
Lxmlcan efficiently upload large files. lxmlhas multiple programming interfaces. First, we use lxml.html to process HTML, and then use lxml. objectid for some XML processing.
(To be continued)
Iv. binary data format
One of the simplest ways to store binary data formats is to use python's built-in pickle serialization. pandas objects all have a save method for saving data to disks in the form of pickle, then, use the pickle function pandas. load reads data back to python:
frame=pd.read_csv('ch06/ec1.csv')frame.save('ch06/frame_pickle')
frame.load('ch06/frame_pickle')
Use HDF5 format
HDF5 refers to a hierarchical data format. Each HDF5 file contains a file-system node structure, which allows you to store multiple datasets and support metadata. HDF5 supports real-time compression of multiple compressors.
Python has two interfaces for processing HDF5, pytable, and h5py.
Read excel files
Pandas's excelfile class supports reading table-type data stored in excel, since excelfile uses xlrd and openpyxl packages, You have to install them first (https://pypi.python.org/pypi/xlrd ), you can create an excelfile instance by passing in an xls or xslx file path. data stored in a worksheet can be read to dataframe through parse.
xls_file=pd.ExcelFile('data.xls')table=xls_file.parse('Sheet1')
5. Use HTML and WEB APIs
Many websites have public APIs that provide data in JSON or other formats. to access these Apis through python, we recommend the requests package as follows:
After reading the webpage information, you can perform more advanced processing.
import requestsurl='http://www.baidu.com'resp=requests.get(url)respimport jsondata=json.loads(resp.text)
Vi. database usage
In specific applications, data is rarely taken from text data. More sources and databases (including relational databases and non-relational databases)