Python Toolkit for formatting and cleaning data

Source: Internet
Author: User
Tags beautifier
The world is messy, and data from the real world is just as messy. A recent survey shows that data scientists spend 60% of their time collating data. Unfortunately, 57% of people think it's the most frustrating part of the job.

Organizing the data is time-consuming, but there are a number of tools that have been developed to make this critical step a little more bearable. The Python community provides many libraries to make the data clear and orderly-from formatting DataFrame to anonymous datasets.

Tell us about the libraries you find useful--we've been working to optimize the libraries that are put into mode Python notebooks.

Dora

Dora is designed for exploratory analysis. In particular, the most painful parts of automated analysis-such as feature selection and extraction, visualization, and what you can guess-are data cleansing. Data cleansing-related functions can:

Read data tables that contain missing data and are not standardized

Assigning values to missing data

Normalized variables

Developer: Nathan Epstein
More information: Https://github.com/NathanEpstein/Dora

DataCleaner

Extra Extra, DataCleaner cleans your data--but only if your data is Pandas DataFrame instances. Developer Randy Olson said: "DataCleaner is not magic, it can't magically parse your unstructured data. ”

It can delete rows that contain missing data, or fill missing data with the number of columns or the median, converting non-numeric variables to numeric variables. The library is new, but given that Dataframe is the basic data structure for Python data analysis, this library is worth trying.

Developer: Randy Olson
More information: Https://github.com/rhiever/datacleaner

Prettypandas

DataFrame are powerful, but they can't make a watch that you can read directly to your boss. Prettypandas uses the Pandas style API to convert DataFrame into a table that can be demonstrated. Generate data summaries, set styles, adjust data formats, columns, and rows. Fringe Benefits: Robust, highly readable use of documents.

Developer: Henry Hammond
More information: Https://github.com/HHammond/PrettyPandas

Tabulate

Tabulate allows you to generate small engaging tables with just one function call. Ideal for making tables more readable by adjusting the decimal column alignment, data formatting, headers, and others.

It has a cool feature that allows tables to be output in different formats: HTML, PHP or Markdown Extra, so you can continue to use the data you've already formatted with other tools or languages.

Developer: Sergey Astanin
More information: https://pypi.python.org/pypi/tabulate

Scrubadub

Data scientists in the health and financial sectors often need anonymous datasets. Scrubadub can remove private information (PII) from text. For example:

Name (noun)

Email address

Network links

Phone number

User name/password group

Skype User Name

Social Security Number

The documentation is a good demonstration of the ways in which you can customize the behavior of Scrubadub, such as defining new PII or preserving specific PII.

Developer: Datascope Analytics
More information: http://scrubadub.readthedocs.io/en/stable/index.html

Arrow

Let's be honest: it's painful to deal with dates and times in Python. The local time zone cannot be automatically identified. The time zone and timestamp must be converted in several lines of less comfortable code.

Arrow aims to solve this problem and fill in this function blank, so that you can use less code and introduce libraries to complete the operation of the date and time. Unlike Python's standard Time library, Arrow automatically recognizes the time zone and UTC by default. You can use a single line of code to complete time zone conversions or parse time strings.

Developer: Chris Smith
More information: http://arrow.readthedocs.io/en/latest/

Beautifier

Beautifier's task is simple: Clean the URLs and email addresses and make them look prettier. You can use the domain name and user name to resolve the email, through the domain name and parameters to resolve the URL. (UTM or Mark)

Developer: Sachin Philip Mathew
More information: Https://github.com/sachinvettithanam/beautifier

Ftfy

Ftfy (fixes text for your) takes in bad Unicode outputs good Unicode. Basically, it fixes all the junk characters. “quotesâ€x9d becomes "quotes"; Uìˆbecomesü; <3 becomes <3. If you work with the text on a daily basis, the This library is, as one user says, "a handy piece of magic."

Ftfy (fixes text for you) translates the messy Unicode into recognizable Unicode. Simply put, it handles all the junk characters. “quotesâ€x9d into "quotes"; uìˆ into U; <3 into <3.

Developer: Luminoso
More information: https://github.com/LuminosoInsight/python-ftfy

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.