Nine powerful command line tools for Linux data analysis

Source: Internet
Author: User
Where can we start with data analysis? For most friends who are familiar with the graphic work environment, spreadsheet tools are undoubtedly the first option. But the command line tool can solve the problem faster and more efficiently-and you only need to learn a little to get started.

Where can we start with data analysis? For most friends who are familiar with the graphic work environment, spreadsheet tools are undoubtedly the first option. But the command line tool can solve the problem faster and more efficiently-and you only need to learn a little to get started.



Open NetEase News to view more Highlights>

Where can we start with data analysis?

For most friends who are familiar with the graphic work environment, spreadsheet tools are undoubtedly the first option. But the command line tool can solve the problem faster and more efficiently-and you only need to learn a little to get started.

Most of these tools are frozen in Linux, and most of them can run in Unix or even Windows environments. In today's article, we will try several simple open-source data analysis tools and learn how they work together.



I. head and tail

First, let's start with file processing. What is in the file? What is the format? You can use the cat command to display files on the terminal, but it is obviously not suitable for processing files with long content.

Enter head and tail to display the specified number of lines in the file. If no row is specified, 10 rows are displayed by default.

$Tail-n3jan2017articles.csv

02Jan2017, Article, ScottNesbitt, 3 tipsforeffectivelyusingwikisfordocumentation, 1,/article/17/1/tips-using-wiki-documentation, "Documentation, Wiki", 710

02Jan2017, Article, JenWikeHuger, TheOpensource. compreviewforJanuary, 0,/article/17/1/editorial-preview-January, 358

02Jan2017, Poll, JasonBaker, WhatisyouropensourceNewYear 'sresolution ?, 1,/poll/17/1/what-your-open-source-new-years-resolution, 186

In the last three lines, I can find the date, author name, title, and other information. However, due to the lack of column headers, I do not know the specific meaning of each column. The following shows the specific titles of each column:

$Head-n1jan2017articles.csv

Postdate, Contenttype, Author, Title, Commentcount, Path, Tags, Wordcount

Now everything is very clear. we can see the release date, content type, author, title, number of submissions, related URLs, various article tags and words.

II. wc

But what should I do if I need to analyze hundreds or even thousands of articles? The wc command is used here-the abbreviation of "word count. Wc can count the number of bytes, characters, words, or rows of a file. In this example, we want to know the number of rows in the article.

Export wc-ljan2017articles.csv93jan2017articles.csv

There are 93 lines in this document. considering that the first line contains the file title, we can assume that this file is a list of 92 articles.

III. grep

The following is a new question: how many articles are related to security topics? To achieve the goal, we assume that the required articles will refer to the word "security" in the title, tag, or other places. In this case, the grep tool can be used to search files by specific characters or other search modes. This is an extremely powerful tool because we can even use regular expressions to create extremely precise matching patterns. But here, we only need to find a simple string.

$ Grep-I "security" January articles.csv

30Jan2017, Article, TiberiusHefflin, 4 waystoimproveyoursecurityonlinerightnow, 3,/article/17/1/4-ways-improve-your-online-security, Securityandencryption, 1242

28Jan2017, Article, SubhashishPanigrahi, success, 0,/article/17/1/how-communities-india-support-privacy-software-freedom, Securityandencryption, 453

27Jan2017, Article, AlanSmithee, datalist vacyday2017: Solutionsforeverydayprivacy, 5,/article/17/1/every-day-privacy, "Bigdata, Securityandencryption", 1424

04Jan2017, Article, DanielJWalsh, numbers, 14,/article/17/1/yearbook-50-ways-avoid-getting-hacked, "Yearbook, 2016 OpenSourceYearbook, Securityandencryption, Containers, Docker, linux ", 2143

The format we use is grep plus-I mark (telling grep not case sensitive), plus the pattern we want to search, and finally the location of the target file we are searching. Finally, we found four security-related articles. If the search scope is more specific, we can use pipe-which can combine grep with wc commands to understand how many lines mentioned the security content.

$ Grep-I "security" January articles.csv | wc-l4

In this way, wc extracts the output result of the grep command and uses it as the input content. Obviously, this combination with a shell script will immediately become a powerful data analysis tool.

IV. tr

In most analysis scenarios, we use CSV files-but how do we convert them to other formats for different application methods? Here, we convert it into HTML form for data use through tables. The tr command can help you achieve this goal. it can convert one type of characters into another type. Similarly, you can use the pipe command to achieve output/input interconnection.

Next, let's try another multi-part example, that is, creating a TSV (that is, a tab-separated value) file that only contains the article published in January 20.

$ Grep "20Jan2017" January articles.csv | tr ', ''/t'> jan20only. tsv

First, we use grep for date query. We use this result pipe to the tr command, and use the latter to replace all commas with tabs ('/t '). But where is the result? Here we use the ">" character to output the result as a new file instead of a screen result. In this way, the dqywjan20only. tsv file must contain the expected data.

$ Catjan20only. tsv20Jan2017ArticleKushalDas5waystoexpandyourproject 'scontributorbase2/article/17/1/expand-project-contributor-baseGettingstarted69020Jan2017ArticleDRuthBavousettHowtowritewebappsinRwithShiny2/article/17/1/writing-new-web-apps-shinyWebdevelopment21820Jan2017ArticleJasonBaker "Top5: upper" 0/article/17/1/top-5-january-20Top521420Jan2017Art IcleTracyMirandaHowisyourcommunitypromotingdiversity? 1/article/17/1/take-action-diversity-techDiversityandinclusion1007

V. sort

What should we do if we first need to find a specific column that contains the most information? If we need to know which article contains the longest list of new articles, we can use the sort command to sort the number of words in the column in The January 20 article list. In this case, we do not need to use intermediate files, but we can continue to use pipe. However, splitting a long command chain into a shorter part can simplify the entire operation process.

$ Sort-nr-t $ '/t'-k8jan20only. tsv | head-n1

20january articletracymirandahowisyourcommunitypromotingdiversity? 1/article/17/1/take-action-diversity-techDiversityandinclusion1007

The above is a long command. We try to split it. First, we use the sort command to sort words. -The nr option instructs sort to sort the results in numbers and sorts the results in reverse order (from large to small ). -T $ '/t' indicates that the separator in sort is tab ('/t '). $ Requires the shell to be a string to be processed and returns/n as a tab. The-k8 section tells the sort command to use Column 8, which is the target column for word count statistics in this example.

Finally, the output result is pipe to the head. after processing, the results show the title of the file containing the maximum number of words.

VI. sed

You may also need to select a specific line in the file. Sed can be used here. If you want to merge multiple files with all titles and only display one set of titles for the overall file, you need to clear additional content; or you want to extract only a specific row range, you can also use sed. In addition, sed can well complete batch search and replacement tasks.

Next we will create a new file without a title based on the previous article list for merging with other files (for example, we will generate a file regularly every month, and now we need to merge the content of each month ).

Eclipsed'1d'january articles.csv> jan17no_headers.csv

The "1 d" option requires sed to delete the first line.

VII. cut

After learning how to delete rows, how can we delete columns? Or how to select only one column? Next we try to create a new author list for the previously generated list.

$Cut-d', '-f3jan17no_headers.csv> authors.txt

Here, both cutand -d represent the third column (-f3}, and send the result to a new file named authors.txt.

VIII. uniq

The author list has been completed, but how do we know how many different authors are included? How many articles have each author written? Here, unip is used. Next we sort the files by sort, find the unique value, calculate the number of articles for each author, and replace the original content with the results.

Sortauthors.txt | uniq-c> authors.txt

Now we can see the number of corresponding articles for each author. check the last three lines below to ensure the results are correct.

$Tail-n3authors-sorted.txt

1 TracyMiranda

1 VeerMuchandi

3VM (Vicky) Brasseur

IX. awk

Finally, let's take a look at awk, the last tool. Awk is an excellent replacement tool, and its functions are far more than that. Next we will return to the TSV file in the January 12 article list and use awk to create a new list to indicate the author of each article and the specific words written by each author.

$ Awk-F "/t" '{print $3 "" $ NF} 'jan20only. tsv

KushalDas690

DRuthBavousett218

JasonBaker214

TracyMiranda1007

The-F "/t" indicates that awk currently processes data separated by tabs. In braces, we provide awk with code for execution. $3 indicates to output the third row, while $ NF indicates to output the last row (abbreviated as 'number of field ), then, add two spaces between the two results to define the division.

Although the examples listed here are small in size and do not seem to have to be solved using the above tools, it is obviously difficult to use the spreadsheet program for processing if the scope is expanded to include 93000 rows of files.

Using these simple tools and small scripts, you can avoid using database tools and easily complete a large amount of data statistics. Whether you are a professional or amateur, its role cannot be ignored.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.