A tutorial on using the into package to clean data migration in Python

Source: Internet
Author: User
Tags object object
Motive

We spend a lot of time migrating data from common interchange formats (such as CSV) to efficient computing formats like arrays, databases, or binary storage. Worse, many people do not migrate data to efficient formats because they do not know how (or cannot) manage specific migration methods for their tools.

The data format you choose is important, and it can strongly affect program performance (the empirical rules indicate a 10 times-fold gap), and those who easily use and understand your data.

When advocating for blaze projects, I often say: "Blaze can help you to query data in various formats." "This is actually assuming that you are able to convert the data into the specified format.

Enter into Project

The into function can efficiently migrate data between various data formats. The data format here includes both in-memory data structures, such as:

Lists, collections, tuples, iterators, Ndarray in NumPy, dataframe in Pandas, array in Dynd, and flow sequences of the above categories.

Also includes persisted data that exists outside of the Python program, such as:

CSV, JSON, line-bound JSON, and remote versions of the above categories

HDF5 (available in both standard and pandas formats), Bcolz, SAS, SQL database (SQLAlchemy supported), Mongo

The into project can efficiently migrate data between any of the two formats in the above data format, with the principle of using a paired-conversion network (the bottom of the article has an intuitive interpretation).

How to use it

The into function has two parameters: source and target. It converts the data from source to target. Source and Target can use the following format:

Target Source Example

Object Object A particular DataFrame or list

String string ' File.csv ', ' postgresql://hostname::tablename '

Type like List or PD. DataFrame

So, here's the legal call to the INTO function:

>>> into (list, DF) # Create new list from Pandas DataFrame >>> into ([], DF) # Append onto existing list  >>> into (' Myfile.json ', DF) # Dump Dataframe to line-delimited json >>> into (Iterator, ' myfiles.*.csv ')  # Stream through many CSV files >>> into (' Postgresql://hostname::tablename ', DF) # Migrate Dataframe to Postgres >>> into (' Postgresql://hostname::tablename ', ' Myfile.*.csv ') # Load CSVs to Postgres >>> into (' MyFile . Json ', ' Postgresql://hostname::tablename ') # Dump Postgres to JSON >>> into (PD. DataFrame, ' mongodb://hostname/db::collection ') # Dump Mongo to DataFrame

Note that to is a single function. We ' re used to doing the various to_csv, From_sql methods on various types. The into API is very small; Here are the need in order to get started:

Note that the into function is a single function. Although we are accustomed to using to_csv, From_sql and other methods to accomplish such functions on various types, interface into is very simple. Before you start using the INTO function, you need to:

$ pip install into >>> from to import into

View the into project on GitHub.

Instance

Now let's show some of the same examples in a deeper way.

Converts a list type in Python to an array type in NumPy

>>> import NumPy as NP >>> into (Np.ndarray, [1, 2, 3]) array ([1, 2, 3])

Load the CSV file and convert to the list type in Python

>>> into (list, ' accounts.csv ') [(1, ' Alice ', '), (2, ' Bob ', ' Max '), (3, ' Charlie ', '), (4, ' Denis ', 400), (5, ' Edith ', 500)]

Convert a CSV file to a JSON format

>>> into (' Accounts.json ', ' accounts.csv ') $ head Accounts.json {"balance": +, "id": 1, "name": "Alice"} {"Balan Ce ": $," id ": 2," name ":" Bob "} {" balance ": +," id ": 3," name ":" Charlie "} {" balance ": $," id ": 4," name ":" Denis "} {"Balance": $, "id": 5, "name": "Edith"}

Converts the JSON format of row delimitation to the Dataframe format in pandas

>>> import pandas as PD >>> into (PD. DataFrame, ' Accounts.json ') Balance ID name 0 1 Alice 1 2 Bob 2 3 Charlie 3 4 Denis 4 5 Edith

How does it work?

Format conversion is challenging. Robust, efficient format conversions between any two data formats are filled with special cases and strange libraries. A common solution is to format conversions through a common format, such as Dataframe or stream memory lists, dictionaries, and so on. (see DAT) or in a serialized format, such as PROTOBUF or thrift, for format conversion. These are good choices and are often what you want. However, sometimes such conversions are slower, especially if you are switching on a real-time computing system or facing a demanding storage solution.

Consider an example of data migration between Numpy.recarray and Pandas.dataframe. We can migrate this data very quickly and appropriately. The bytes of the data do not need to be changed, only the metadata around them can be changed. We don't need to serialize the data into one interchange format, or convert it to a pure Python object in the middle.

Consider migrating data from a CSV file to a PostgreSQL database. Using Python iterators with SQLAlchemy (Note: A database toolbox in a Python environment), our migration speed is unlikely to exceed 2000 records per second. However, with PostgreSQL's own CSV loader, we can migrate faster than 50,000 records per second. There is a big difference between spending a whole night and spending a cup of coffee with the data migration. However, this requires us to be flexible enough to use special code in special cases.

Specialized 22 swap tools tend to be an order of magnitude faster than a generic solution.

The into project is a network of data migrations that are paired to the ground. We use this network to showcase:

Each node is a data format. Each directed edge is a function that transforms data between two data formats. A call to the into function may traverse multiple edges and multiple intermediate formats. For example, when we migrate a CSV file to the MONGO database, we can take the following path:

? load the CSV file into Dataframe (with Pandas.read_csv)

? then convert to Np.recarray (with Dataframe.to_records)

? then convert to a Python iterator type (with Np.ndarray.tolist)

? eventually converted to data in MONGO (using Pymongo.Collection.insert)

Or we can use MongoDB's own CSV loader to write a special function that shortens the entire process with a directional edge from CSV to MONGO.

To find the most efficient route, we use the relative cost (the Ad-hoc that introduces the weights) to give the weight value to all the edges of the network. Then we use NETWORKX to find the shortest path, and then we migrate the data. If an edge fails for some reason (throwing notimplementederror), we can automatically re-look for the path. This way our migration approach is both efficient and robust.

Notice that we're painting some nodes in red. The amount of data for these nodes can be greater than memory. When we migrate data between two red nodes (the amount of data input and output may be greater than memory), we limit our path to always be in the red sub-graph to ensure that the data in the middle of the migration path does not overflow. One of the formats to be aware of is chunks (...), such as chunks (DataFrame), which is an iterative, in-memory dataframes. This handy meta-format allows us to use compact data structures on big data, such as NumPy's arrays and pandas dataframes, while maintaining only a few 10 megabytes of data in memory.

This networked approach allows developers to write specialized code for special situations, while believing that the code is only used in the correct context. This approach allows us to deal with a very complex problem in a separate, separable way. The central dispatch system keeps our heads clear.

History

A long time ago, I wrote into a link to Blaze's article, and then I was immediately silent. This is because the old implementation method (before the network method) is difficult to extend or maintain, and is not ready to enter its golden period.

I am very satisfied with this network. Unexpected applications often run normally, and into projects are now ready to enter their golden phase. Into works can be obtained by Conda and Pip, and independent of blaze. It relies heavily on numpy, pandas and Networkx, so it's relatively lightweight for most people who read my blog. If you want to take advantage of some better-performing formats, such as HDF5, you will also need to install these libraries (Pro-tip, using Conda installation).

How to get started with the into function

You should download a recent version of Into project.

$ pip Install--upgrade Git+https://github.com/continuumio/into or $ conda install into--channel blaze

Then you may want to pass the upper part of the tutorial, or read the document.

Or not reading anything, just give it a try. My hope is that this interface is very simple (only one function!). ), users can use it naturally. If you run into problems, then I would love to hear them in the Blaze-dev@continuum.io.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.