The world is messy, and data from the real world is just as messy. A recent survey shows that data scientists spend 60% of their time collating data. Unfortunately, 57% of people think it's the most frustrating part of the job.
Organizing the data is time-consuming, but there are a number of tools that have been developed to make this critical step a little more bearable. The Python community provides many libraries to make the data clear and orderly-from formatting DataFrame to anonymous datasets.
Tell us about the libraries you find useful--we've been working to optimize the libraries that are put into mode Python notebooks.
Dora
Dora is designed for exploratory analysis. In particular, the most painful parts of automated analysis-such as feature selection and extraction, visualization, and what you can guess-are data cleansing. Data cleansing-related functions can:
Read data tables that contain missing data and are not standardized
Assigning values to missing data
Normalized variables
Developer: Nathan Epstein
More information: Https://github.com/NathanEpstein/Dora
DataCleaner
Extra Extra, DataCleaner cleans your data--but only if your data is Pandas DataFrame instances. Developer Randy Olson said: "DataCleaner is not magic, it can't magically parse your unstructured data. ”
It can delete rows that contain missing data, or fill missing data with the number of columns or the median, converting non-numeric variables to numeric variables. The library is new, but given that Dataframe is the basic data structure for Python data analysis, this library is worth trying.
Developer: Randy Olson
More information: Https://github.com/rhiever/datacleaner
Prettypandas
DataFrame are powerful, but they can't make a watch that you can read directly to your boss. Prettypandas uses the Pandas style API to convert DataFrame into a table that can be demonstrated. Generate data summaries, set styles, adjust data formats, columns, and rows. Fringe Benefits: Robust, highly readable use of documents.
Developer: Henry Hammond
More information: Https://github.com/HHammond/PrettyPandas
Tabulate
Tabulate allows you to generate small engaging tables with just one function call. Ideal for making tables more readable by adjusting the decimal column alignment, data formatting, headers, and others.
It has a cool feature that allows tables to be output in different formats: HTML, PHP or Markdown Extra, so you can continue to use the data you've already formatted with other tools or languages.
Developer: Sergey Astanin
More information: https://pypi.python.org/pypi/tabulate
Scrubadub
Data scientists in the health and financial sectors often need anonymous datasets. Scrubadub can remove private information (PII) from text. For example:
Name (noun)
Email address
Network links
Phone number
User name/password group
Skype User Name
Social Security Number
The documentation is a good demonstration of the ways in which you can customize the behavior of Scrubadub, such as defining new PII or preserving specific PII.
Developer: Datascope Analytics
More information: http://scrubadub.readthedocs.io/en/stable/index.html
Arrow
Let's be honest: it's painful to deal with dates and times in Python. The local time zone cannot be automatically identified. The time zone and timestamp must be converted in several lines of less comfortable code.
Arrow aims to solve this problem and fill in this function blank, so that you can use less code and introduce libraries to complete the operation of the date and time. Unlike Python's standard Time library, Arrow automatically recognizes the time zone and UTC by default. You can use a single line of code to complete time zone conversions or parse time strings.
Developer: Chris Smith
More information: http://arrow.readthedocs.io/en/latest/
Beautifier
Beautifier's task is simple: Clean the URLs and email addresses and make them look prettier. You can use the domain name and user name to resolve the email, through the domain name and parameters to resolve the URL. (UTM or Mark)
Developer: Sachin Philip Mathew
More information: Https://github.com/sachinvettithanam/beautifier
Ftfy
Ftfy (fixes text for your) takes in bad Unicode outputs good Unicode. Basically, it fixes all the junk characters. “quotesâ€x9d becomes "quotes"; Uìˆbecomesü; <3 becomes <3. If you work with the text on a daily basis, the This library is, as one user says, "a handy piece of magic."
Ftfy (fixes text for you) translates the messy Unicode into recognizable Unicode. Simply put, it handles all the junk characters. “quotesâ€x9d into "quotes"; uìˆ into U; <3 into <3.
Developer: Luminoso
More information: https://github.com/LuminosoInsight/python-ftfy