Motivation
We spend a lot of time migrating data from a common interchange format (such as CSV) to an efficient computing format like arrays, databases, or binary storage. To make things worse, many people do not migrate data to efficient formats because they don't know how (or cannot) manage specific migration methods for their tools.
The format of the data you choose is very important, it will strongly affect the performance of the program (empirical rules show that there will be 10 times times the gap), and those who use and understand your data easily.
When advocating blaze projects, I often say: "Blaze can help you to query the data in various formats." "This actually assumes that you can convert the data into the specified format."
Enter into Project
into functions efficiently migrate data between various data formats. The data format here includes both the data structure in memory, such as:
Lists, sets, tuples, iterators, Ndarray in NumPy, dataframe in Pandas, array in Dynd, and flow sequences of the above categories.
It also includes persisted data that exists outside of the Python program, such as:
CSV, JSON, line-bound JSON, and remote versions of all of the above
HDF5 (standard format and pandas format are available), Bcolz, SAS, SQL database (SQLAlchemy supported), Mongo
An into project can efficiently migrate data between any two formats in the data format, using a pair-switched network (intuitive explanation at the bottom of the article).
How to use it
The into function has two parameters: source and target. It converts data from source to target. Source and Target can use the following format:
Target Source Example
Object A particular dataframe or list
String string ' File.csv ', ' postgresql://hostname::tablename '
Type like List or PD. Dataframe
So, the bottom is the legitimate call to the INTO function:
>>> to (list, DF) # Create new list from Pandas Dataframe
>>> into ([], DF) # Append onto existing L ist
>>> into (' Myfile.json ', DF) # Dump Dataframe to line-delimited json
>>> into (iterator, ' Myfiles.*.csv ') # Stream through many CSV files
>>> into (' Postgresql://hostname::tablename ', DF) # Migrate D Ataframe to Postgres
>>> in (' Postgresql://hostname::tablename ', ' Myfile.*.csv ') # Load Csvs to postgres< C6/>>>> into (' Myfile.json ', ' Postgresql://hostname::tablename ') # Dump Postgres to JSON
>>> into (PD. Dataframe, ' mongodb://hostname/db::collection ') # Dump Mongo to Dataframe
This is the a single function. We ' re used to doing this with various to_csv, From_sql methods on various types. The into API is very small; This is what your need in order to get started:
Notice that the into function is a single function. Although we are accustomed to using to_csv, From_sql and other methods on various types to accomplish such functions, interface into is very simple. Before you start using the INTO function, you need to:
$ pip install into
>>> from into import
View the into project on the GitHub.
Instance
Now let's show you some of the deeper, same examples.
Converts the list type in Python to the array type in NumPy
>>> import NumPy as NP
>>> into (Np.ndarray, [1, 2, 3])
Array ([1, 2, 3])
Loading a CSV file and converting it into a list type in Python
>>> into (list, ' accounts.csv ')
[(1, ' Alice ',],
(2, ' Bob ',),
(3, ' Charlie ',
")," ( 4, ' Denis ',
(5, ' Edith ', 500)]
Convert CSV file to JSON format
>>> to (' Accounts.json ', ' Accounts.csv ')
$ head Accounts.json
{"balance": $, "id": 1, "name": " Alice "}
{" balance ":" id ": 2," name ":" Bob "}
{" balance ": +," id ": 3," name ":" Charlie "}
{" Balance ": 400 , "id": 4, "name": "Denis"}
{"balance": +, "id": 5, "name": "Edith"}
Converts a line-bound JSON format into a dataframe format in pandas
>>> import pandas as PD
>>> into (PD. Dataframe, ' Accounts.json ')
Balance ID name
0 of 1 Alice
1 2 Bob
2 3 + Charlie
3 4 Den is
4 5 Edith
How does it work?
Format conversion is challenging. Robust, efficient format conversions between any two data formats are full of special situations and strange libraries. A common solution is to format conversions through a common format, such as Dataframe or streaming memory lists, dictionaries, and so on. (see DAT) or format conversion by serializing a format, such as PROTOBUF or thrift. These are good choices and are often what you want. However, sometimes such conversions are slow, especially if you are converting on a real-time computing system or facing a demanding storage solution.
Consider an example of data migration between Numpy.recarray and Pandas.dataframe. We can migrate this data very quickly and appropriately. The byte of the data does not need to be changed, only the metadata around it is changed. We do not need to serialize data to an interchange format or convert to a pure Python object in the middle.
Consider migrating data from a CSV file to a PostgreSQL database. Using Python iterators with SQLAlchemy (Note: A database toolbox in a Python environment), our migration speed is unlikely to exceed 2000 records per second. However, using PostgreSQL's own CSV loader, we can migrate faster than 50,000 records per second. There's a big difference between spending a whole night and a cup of coffee to migrate data. However, this requires us to be flexible enough to use special code in exceptional circumstances.
Specialized 22 interchange tools tend to be one order of magnitude faster than general-purpose solutions.
The into project is a network of data migrations that are made into pairs. We use the following figure to show this network:
Each node is a data format. Each directed edge is a function that converts data between the two data formats. into a function that may traverse multiple edges and multiple intermediate formats. For example, when we migrate a CSV file to a MONGO database, we can take the following path:
? load a CSV file into Dataframe (using pandas.read_csv)
? then convert to Np.recarray (using dataframe.to_records)
? then convert to a Python iterator type (using np.ndarray.tolist)
? The final conversion into MONGO data (using Pymongo.Collection.insert)
Or we can use MongoDB's own CSV loader to write a special function that shortens the entire process with a directional edge from CSV to MONGO.
To find the most efficient route, we use the relative cost (AD-HOC) to assign weights to all the edges of the network. Then we use NETWORKX to find the shortest path, and then data migration. If a side fails for some reason (triggering notimplementederror), we can automatically find the path again. So our migration methods are both efficient and robust.
Notice that we paint some of the nodes red. The amount of data for these nodes can be greater than memory. When we migrate data between two red nodes (both input and output can be larger than memory), we limit our paths to the red ones to ensure that the data in the middle of the migration path does not overflow. One format to note is chunks (...), such as chunks (dataframe), which is an iterative, dataframes in memory. This handy meta format allows us to use compact data structures on large data, such as NumPy's arrays and pandas dataframes, while maintaining only a few 10 megabytes of data in memory.
This networked approach allows developers to write specialized code for special situations and is sure that the code is only used in the right circumstances. This approach allows us to deal with a very complex problem in an independent, separable way. The central dispatch system keeps us sane.
History
A long time ago, I wrote into links to Blaze's article, and then I was immediately silent. This is because the old implementation method (before the network method) is difficult to extend or maintain, and is not ready to enter its golden period.
I'm very satisfied with the network. Unexpected applications are often able to work properly and into projects are now ready to enter their golden period. Into projects can be obtained by Conda and Pip, independent of Blaze. It relies mainly on numpy, pandas and Networkx, so it's relatively lightweight for most people who read my blog. If you want to take advantage of some of the better performance formats, such as HDF5, you will also need to install these libraries (Pro-tip, use Conda installation).
How to get started with the into function
You should download a recent version of Into project.
$ pip Install--upgrade Git+https://github.com/continuumio/into
or
$ conda install into--channel blaze
Then you might want to pass the top half of the tutorial, or read the document.
Or do not read anything, just try. My hope is that this interface is simple (only one function!). , users can use it naturally. If you have problems running, then I would love to hear them in Blaze-dev@continuum.io.