Nine command-line tools to help data analysis in Linux environment __linux

Source: Internet
Author: User
Tags documentation first row

"51cto.com fast translation" to the data analysis, we will start from where?

For most friends who are familiar with the graphics work environment, the spreadsheet tool is undoubtedly the first option. But command-line tools can also solve problems faster and more efficiently--and you need to learn a little bit more.

Most of this type of tool freeze is strictly limited to Linux, and most can also run under UNIX and even windows. In today's article, we'll try out a few simple open source data analysis tools and see how they work together.

first, head and tail

First, let's start with the file processing. What's in the file? What's the format? You can use the Cat command to display a file in a terminal, but it's obviously not appropriate to handle a file with a long content.

Enter head and tail, both of which can fully display the specified number of lines in the file. If you do not specify the number of rows, the default displays 10 rows.

$ tail-n 3 jan2017articles.csv 2017,article,scott nesbitt,3 tips for effectively using wikis for documentation,1 ,/article/17/1/tips-using-wiki-documentation, "documentation, wiki", 710 2017,article,jen Wike Huger,The Opensource.com preview for january,0,/article/17/1/editorial-preview-january,,358 2017,poll,jason Baker,What is Your open source New year ' s resolution?,1,/poll/17/1/what-your-open-source-new-years-resolution,,186

In the last three lines, I was able to find the date, author name, title, and other information. However, due to the lack of column headers, I do not know the specific meaning of each column. See the specific headings for each column below:

$ head-n 1 jan2017articles.csv Post date,content type,author,title,comment count,path,tags,word Count

Now that everything is clear, we can see the release date, content type, author, title, number of submissions, related URLs, each article label, and Word count.

Second, WC

But what if you need to analyze hundreds of or even thousands of articles? The WC command is used here-the abbreviation for the word "word". The WC can count the number of bytes, characters, words, or rows in a file. In this example, we want to know the number of rows in the article.

$ wc-l jan2017articles.csv jan2017articles.csv

There are 93 lines in this file, so you can speculate that this file is a list of 92 articles, considering that the first row contains a file header.

Third, grep

Here's a new question: how many of these articles are related to security issues? In order to achieve the goal, we assume that the required article will refer to the word security in the title, tag, or other location. At this point, the grep tool can be used to search for files through specific characters or to implement other search patterns. This is a very powerful tool because we can even use regular expressions to create extremely precise matching patterns. But here, we just need to look for a simple string.

 
$ grep -i  "Security"  jan2017articles.csv   30 jan 2017,article,tiberius  hefflin,4 ways to improve your security online right now,3,/ article/17/1/4-ways-improve-your-online-security,security and encryption,1242   28 Jan  2017,article,subhashish panigrahi,how communities in india support privacy  and software freedom,0,/article/17/1/how-communities-india-support-privacy-software-freedom, security and encryption,453   27 jan 2017,article,alan smithee,data  privacy day 2017: solutions for everyday privacy,5,/article/17/1/ Every-day-privacy, "Big data, security and encryption",1424   04 Jan  2017,article,daniel j walsh,50 ways to avoid getting hacked in  2017,14,/article/17/1/yearbook-50-ways-avoid-getting-hacked, "Yearbook, 2016 open source yearbook, security and  encryption, containers, docker, linux ",2143 

The format we use is the grep plus-I tag (tells grep to be case-insensitive), plus the pattern we want to search, and finally the location of the target file we're searching for. Finally, we found 4 articles on security related. If the scope of the search is more specific, we can use pipe--to combine grep with the WC command to see how many of these rows refer to secure content.

$ grep-i "Security" jan2017articles.csv | Wc-l 4

In this way, the WC extracts the output of the grep command and takes it as input. Obviously, this combination plus a bit of shell script, the terminal will immediately become a powerful data analysis tool.

Four, tr

In most analysis scenarios, we all face a CSV file--but how do we convert it to another format for different applications? Here, we convert it to HTML to use in tabular data. The TR command can help you achieve this by converting one type of character into another. Similarly, we can also cooperate with the pipe command to achieve output/input docking.

Next, let's try another, more partial example, to create a TSV (a tab-delimited value) file that contains only articles published on January 20.

$ grep "2017" jan2017articles.csv | Tr ', ' \ t ' > JAN20ONLY.TSV

First, we use grep to make date queries. We pipe this result to the TR command and use the latter to replace all commas with tab (expressed as ' \ t '). But where did it go? Here we use the character to output the result as a new file rather than a screen result. In this way, we can dqywjan20only.tsv that the file must contain the expected data.

$ cat jan20only.tsv 20 jan 2017 article kushal das 5 ways  to expand your project ' s contributor base 2 /article/17/1/ Expand-project-contributor-base getting started 690 20 jan 2017 article  D Ruth Bavousett How to write web apps in R with  Shiny 2 /article/17/1/writing-new-web-apps-shiny web development 218 20 jan  2017 Article Jason Baker  "top 5: shell scripting the  Cinnamon linux desktop environment and more " 0 /article/17/1/ Top-5-january-20 top 5 214 20 jan 2017 article tracy miranda  how is your community promoting diversity? 1 /article/17/1/ Take-action-diversity-tech diversity and inclusion 1007 

Five, sort

What if we first find the specific column that contains the most information? Suppose we need to know which article contains the longest list of new articles, so we can use the sort command to sort the number of words in front of the list of January 20 articles. In this case, we do not need to use intermediate files, but we can continue to use pipe. However, splitting long command chains into shorter ones can often simplify the entire process.

$ sort-nr-t$ ' t '-k8 JAN20ONLY.TSV | Head-n 1 2017 Article Tracy Miranda How are your community the promoting diversity? 1/article/17/1/take-action-diversity-tech Diversity and Inclusion 1007

The above is a long command, we try to split. First, we use the sort command to sort the number of words. The-NR option tells sort to sort numerically and to reverse the results (from large to small). Subsequent-t$ ' \ t ' tells the sort where the Separator is tab (' \ t '). The $ requirement for this shell is a string that needs to be processed and \ n is returned as tab. The-k8 section tells the sort command to use the eighth column, which is the target column for word count in this example.

Finally, the output is pipe to head, which displays the title of the article in the result that contains the maximum number of words in the file.

Six, sed

You may also want to select a specific row in the file. You can use sed here. You can also use SED if you want to merge multiple files that contain all the headings and only display a set of headings for the overall file, that is, you need to clear the extra content, or you want to extract only a specific row range. In addition, SED is well able to perform bulk lookup and replacement tasks.

The following list of previous articles creates a new file without headings for merging with other files (for example, we periodically generate a file every month, and now we need to merge the contents of each month).

$ sed ' 1 d ' jan2017articles.csv > Jan17no_headers.csv

The "1 d" option requires SED to delete the first line.

Seven, cut

Knowing how to delete a row, how do we delete a column? Or how to select only one column? Below we try to create a new author list for the list that was generated earlier.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.