Big Data Cleaning Tools

Source: Internet
Author: User
Keywords big data big data cleaning big data cleaning tools
Before data analysis and visualization, it is often necessary to "clean" the data. What does this mean? There may be some entries in the list that are "New York City", while others are written as "New York, NY". However, you have to standardize various input vocabulary before seeing certain patterns. Or, there are some numeric input errors, typos or something.

There are many tools that can achieve  data cleaning, but most of them are paid. For professionals, these costs are worthwhile, but for amateurs who use them from time to time, it is a bit wasteful. The great thing about the two tools described below is that they are free!

DataWrangler

What it does: This web-based service is designed by Stanford University ’s visualization team to clean and rearrange data, so its format is suitable for applications such as spreadsheets.

Click on a row or a column, DataWrangler will have suggestions for modification. For example, if you click on a blank line, some suggestions will pop up, like "Delete Line" or "Delete Blank Line".

At the same time, DataWrangler has a history record that allows you to easily implement the undo function.

Pros: Text editing is very simple. For example, when I select "Alabama" in a row of sample data with the headline "Reported crime in Alabama" and then select "Alaska" in another set of data, it will suggest extracting the name of each state. Hover your mouse over the suggestion and you will see the line highlighted in red.

Disadvantages: I found that some unexpected changes occurred when I tried to explore DataWrangler's options. I often have to click "Empty" to reset. In addition, some suggestions are useless (when a line is blank, "it seems to be a strange suggestion to add the line to the title line"), and some suggestions are difficult to understand ("fold split 1 using 2 as key ").

DataWrangler is a web-based service that is very convenient to use. But don't forget, the price is that the data must be uploaded to an external website. In other words, DataWrangler is not a suitable choice for sensitive internal data. However, there will be a separate desktop version in the future. Another thing that must be considered is that DataWrangler is written in the current alpha code, and its creator said that it (alpha code) is still being improved.

Skill level: Advanced novice

Operating environment: any web browser

Here I recommend to you a big data development communication circle: 658558542 (☛ Click to join the group chat) A large amount of learning materials are organized in it, all of which are dry goods, including big data technology introduction, big data offline processing, real-time data Processing, Hadoop, Spark, Flink, recommendation system algorithms and source code analysis, etc., are sent to every big data partner to make self-learning easier. This is not only the gathering place of Xiaobai, but also the online answer of Daniel! Welcome beginners and advanced friends to join the group study and exchange together and make progress together!

Google Refine

What it does: When you first look at the text and numbers of Google Refine, you can describe it as a spreadsheet. Like Excel, it can import and export data in multiple formats, such as text files separated by tags or commas, Excel, XML, and JSON files.

Refine has a built-in algorithm, you can find some texts that are spelled differently but should actually be divided into a group. After importing your data, select Edit Cell-> Cluster, Edit, and then select the algorithm to use.

After Refine runs, you have to decide whether to accept or not accept each suggestion. For example, you can agree to use Microsoft and Microsoft Inc as the same combination, but do not agree to use Coach Inc and CQG Inc as the same combination. If it provides too few or too many suggestions, you can change the strength of the suggestion function.

There are also data options that provide a quick and easy overview of data distribution. This function can reveal anomalies that may be caused by input errors-for example, wage records are not $ 80,000 but actually $ 800,000; or point out inconsistencies-for example, differences between payroll data records, some are hourly wages, some are weekly Pay, some are annual salary.

In addition to the data steward function, Google Refine also provides some useful analysis tools, such as sorting and filtering.

Advantages: Once familiar with Refine's commands and functions, it will be a powerful data processing and analysis tool, both powerful and easy to use. The undo / redo list of each operation brings you back to the desired state at any time. Text editing uses Java regular expressions, allowing you to find patterns (for example, 3 digits followed by 2 digits) or specific strings or numeric values.

Finally, although Refine is a browser-based application, it is suitable for desktop files, so your data can be kept locally.

Disadvantages: Although Refine looks like a spreadsheet, you cannot use it to achieve typical spreadsheet calculations. Therefore, you must export the data to a common spreadsheet application. If your data set is large, you have to set aside some time to carefully check Refine's suggestion, which will take some time. Also, this differs depending on the data set. When you are ready to merge some text items, you will most likely get some wrong suggestions or ignore some problems-or both.

Skill level: Advanced novice.

Operating environment: Windows, Mac OS, Linux

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.