Python Data Analysis notes-retrieval, processing and storage of data

Source: Internet
Author: User

Data retrieval, processing and storage 1. Write to a CSV file using NumPy and Pandas

To write to the CSV file, NumPy's Savetxt () function is a function that corresponds to Loadtxt (), and he can save the array in a partition file format such as CSV:

Np.savetxt ('np.csv', a,fmt='%.2f', delimiter=' , ', header='#1, #2, #3, #4")

In the above function call, we specify the name, array, optional format, spacer (the default is a space character) and an optional caption for the file to hold the array.

Use a random array to create the pandas DataFrame, as follows:

DF=PD. DataFrame (a)

Use Pandas's To_csv () method to generate a dataframe for a CSV file

Df.to_csv ('pd.csv', float_format='%.2f', na_rep=' nan! ")

For this method, we need to provide a file name, an optional format string similar to the format parameter of the Savetxt () function of NumPy, and an optional strings representing Nan.


2.numpy.npy and Pandas DataFrame

In most cases, it is a good idea to save files in CSV format, because most programming languages and applications can handle this format, so it is very convenient to communicate. However, this format has a flaw, that is, its storage efficiency is not very high, because the CSV and other plain text format contains a large number of whitespace characters, and later invented some file formats, such as ZIP, bZIP and gzip, the compression rate has been significantly improved.

Import NumPy asNpimport Pandas asPD fromtempfile Import Namedtemporaryfile fromos.path Import getsizenp.random.seed ( the) A=NP.RANDOM.RANDN (365,4) Tmpf=namedtemporaryfile () np.savetxt (Tmpf,a,delimiter=',') Print ("Size CSV File", GetSize (tmpf.name)) Tmpf=namedtemporaryfile () np.save (tmpf,a) Tmpf.seek (0) Loaded=np.load (Tmpf) print ("Shape", Loaded.shape) print ("Size. npy file", GetSize (tmpf.name)) DF=PD. DataFrame (a) df.to_pickle (tmpf.name) print ("Size Pickled dataframe", GetSize (tmpf.name)) print ("DF from pickle\n", Pd.read_pickle (Tmpf.name))


NumPy provides itself with a dedicated format, called. Girlfriend, which can be used to store numpy arrays. Before further explaining this format, let's build a 365x4d numpy array and populate the individual elements with random values. This array can be seen as a simulation of the daily observations of 4 random variables in a year. Here we use the Python standard namedtemporaryfile to store the data, and these temporary files are then automatically deleted.

The following will save the array in a CSV file and check its size with the following code:

tmpf=namedtemporaryfile () np.savetxt (tmpf,a,delimiter=',') print ( " Size CSV File ", GetSize (tmpf.name))


The following first saves the array in NUMPY.NPY format, then loads the memory and checks the shape of the array and the size of the. npy file, as shown in the following code:

tmpf=namedtemporaryfile () np.save (tmpf,a) tmpf.seek (0) loaded=np.load (tmpf) print (  "shape", Loaded.shape) print ("Size. npy file" , GetSize (tmpf.name))


To simulate the closing and reopening of the temporary file, we called the Seek () function in the code above. The shape of the array and the file size are as follows:

Shape (365,411760


As expected, the size of the. npy file is only about 1/3 of the CSV file. In fact, Python can be used to store arbitrarily complex data structures. You can also serialize the format to store pandas Dataframe or series data structures.

Tips:

In Python, pickle is a format used to store Python objects on disk or other media, This process of formatting is called serialization (pickling). After that, we can reconstruct the Python object from memory, which is called deserialization (unpickling).

First create a dataframe with the previously generated NumPy array, and then use the To_pickle () method to write it to a Pickle object, and then use the Read_pickle () function to retrieve the pickle from the Dataframe object:

df=PD. DataFrame (a) df.to_pickle (tmpf.name) print ("Size pickled DataFrame", GetSize ( Tmpf.name)) Print ("DF from pickle\n", Pd.read_pickle (tmpf.name))


After the Dataframe is serialized, the size is slightly larger than the. npy file.

3. Storing data using Pytables

Hierarchical Data Format (HDF) is a technical specification for storing large numeric data, originated in the Supercomputing community, and is now an open standard. This is done with HDF5, which organizes data only through the two basic structures of group and dataset (DataSet). Datasets can be multidimensional arrays of the same type, and groups can be used to hold other groups or datasets. The "group" here is very much like the "directory" in the hierarchical file system.

The most common two main Python libraries for HDF5 are:

1.h5y

2.PyTables

In this example, Pytables is used. However, this library requires some dependencies, such as:

1.Numpy

2.NUMEXPR: The package is much faster than numpy when it calculates an array expression that contains multiple operations

3.HDF5: If you are using a parallel version of HDF5, you will also need to install MPI.


According to NUMEXPR, it has a much faster operation in some ways than NumPy because it supports multi-threading, and its own virtual machine is a C language implementation.


In addition, we need to generate some random numbers and use them to assign values to a numpy array. The following creates a HDF5 file and mounts the NumPy array to the root node.

 fromtempfile import Namedtemporaryfileimport numpy asNP fromOs.path Import Getsizea=NP.RANDOM.RANDN (365,4) Tmpf=namedtemporaryfile () h5file=tables.openfile (tmpf.name,mode='W', title="Numpy Array") Root=H5file.rooth5file.createArray (Root,"Array", a) the H5file.close () #读取这个HDF5 and displays the file size H5file=tables.openfile (Tmpf.name,"R") Print (GetSize (tmpf.name) )


By traversing to the data inside

 for inch h5file.iternodes (h5file.root):    b=node.read ()    print (type (b), B.shape)


Read and write operations between 4.Pandas Dataframe and HDF5 warehouses

The Hdfstore class can be seen as an abstraction in the pandas that is responsible for the HDF5 data processing part. With some random data and temporary files, this kind of functional feature is well presented, as shown in the following steps:

Pass the temporary file path to the Hdfstore constructor, and then create a warehouse:

Store=pd.io.pytables.HDFStore (tmpf.name) print (store)

The code above prints out the file path and its contents of the warehouse, but at the moment he doesn't have any content.


Hdfstore provides a dictionary-like interface, as we can store values by Dataframe query keys in pandas. In order to store a dataframe containing random data into Hdfstore, you can use the following code:

store['df']=dfprint (store)

We can access the Dataframe in three ways: using the Get () method to access the data, accessing the data using a dictionary-like query key, or accessing the data using a dot operation symbol:

Print ("Get", store.  Get('df'). Shape) Print ("Lookup", store[ ' DF ' ].shape) Print ("dotted", Store.df.shape)

The shape of the dataframe can also be accessed in 3 different ways


In order to delete data from the warehouse, we can either use the Remove () method or the Del operator. Of course, each data can only be deleted once.

Del store['df']

The purpose of property Is_open is to indicate whether the warehouse is open. To close a warehouse, you can call the close () method. The following code shows how to close the warehouse and check the status of the warehouse accordingly:

Print ("before close", Store.is_open) store.close () print ("after Close", Store.is_open)

To read and write HDF data, pandas also provides two methods: one is the TO_HDF () method of Dataframe, and the other is the top-level READ_HDF () function.

DF.TO_HDF (Tmpf.name,'data', format='table') print ( PD.READ_HDF (tmpf.name,'data',where=['index>363  ']))

The parameters for the application interface for read and write operations include the file path, the identifier of the group in the warehouse, and the optional format string. There are two types of formats: one is a fixed format, and one is a tabular format. The advantage of a fixed format is that it is faster, with the drawback that data cannot be appended or searched. The table format is equivalent to the table structure of Pytables, which allows you to search and select the data.


5. Read and write Excel files using pandas

Many important data in the real world are stored in the form of Excel files. Of course, if you want, you can also convert it to a more portable format such as CSV. However, using Python to manipulate Excel files is more convenient. In the world of Python, there are usually more than one project to achieve the same goal, such as a project that provides Excel I/O operations functionality. As long as these modules are installed, pandas can have the ability to read and write Excel files, but these aspects of the documentation is not very complete, because the pandas relies on these projects are often fighting each other, and development is very rapid. These pandas packages are also critical for Excel files, requiring that the suffixes of these files be. xls or. xlsx;

The module OPENPYXL originates from Phpexcel, which provides read and write capabilities for. xlsx files.

Module Xlsxwriter also needs to read. xlsx files

The module xlrd can be used to extract data from. xls and. xlsx files. Next, let's build a random number to populate the Dataframe in pandas, then use this dataframe to create an Excel file, then recreate the dataframe with the Excel file and calculate its mean by the mean () method. For worksheets of Excel files, we can specify either a 0-based index or a name

Import NumPy asNpimport Pandas asPD fromtempfile Import NamedTemporaryFilenp.random.seed ( the) A=NP.RANDOM.RANDN (365,4) Tmpf=namedtemporaryfile (suffix='. xlsx') DF=PD. DataFrame (a) print (Tmpf.name) df.to_excel (Tmpf.name,sheet_name='Random Data') Print ("means\n", Pd.read_excel (Tmpf.name,'Random Data'). mean ())


Create an Excel file with the To_excel () method, and use the top-level read_excel () function to rebuild the Dataframe


6. Using pandas to read and write JSON

Pandas provides a read_json () function that can be used to create Pandas series or pandas dataframe data structures.

Import Pandas asPdjson_str='{"Country": "Netherlands", "Dma_code": "0", "timezone": "Europe\/amsterdam", "Area_code": "0", "IP": "46.19.37.108" , "ASN": "AS196752", "Continent_code": "EU", "ISP": "Tilaa v.o.f.", "Longitude": 5.75, "latitude": 52.5, "Country_code": " NL "," Country_code3 ":" NLD "}'Data=pd.read_json (json_str,typ='Series') Print ("series\n", data) data["Country"]="Brazil"Print ("New series\n", Data.to_json ())

When you call the Read_json () function, you can either pass a JSON string to it, or you can specify a path to the JSON file for it. In the example above, we are using JSON strings to create the pandas Series

and modify the value of the country again and convert it from the Pandas series to a JSON string using the To_json () method

Python Data Analysis notes-retrieval, processing and storage of data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.