Using the WC command to count the file lines, the number of words, the number of characters, using sort sorting and de-weight, combined with uniq can be the word frequency statistics.
Cat File.txtsort hello.c | uniq-c | Sort-nr |head-5
Use cat
commands to view file formats and content. Sort the files first, then use the uniq -c
command to count the number of different words and the occurrences of each word, the result is the word immediately after the number of times, and then use sort -nr
the number of times to sort, and reverse display, the final head -5
command to display the first 5 rows of results.
Similar to the SQL statement:
Select Word,count (1) cntfrom filegroup by Wordorder by CNT Desclimit 5;
Exploratory analysis
Common commands:
gzip/tar
: Compression/Decompression
cat/zcat
: File View
less/more
: File view, support GZ compressed format direct view, paging view file
head/tail
: View file before/after 10 lines
wc
: Count the number of rows, words, and characters
du -h -c -s
: View space usage
awk
: command-line database operations Tool
join/cut/paste
: Associate File/Shard Field/merge file
fgrep/grep/egrep
: Global Regular Expression Lookup
find
: Find files and bulk perform tasks on Find results
sed
: Stream editor, batch modify, replace file
split
: How many lines a file, or how many bytes of a file to split the file
rename
: Batch duplicate names (Perl scripts on Ubuntu, other systems need to be installed), -n
test with commands
gzip-d a.gz #解压缩日志
Tar zcvf/jcvf one.tar.bz2 One #直接查看压缩日志
Less a.gz #无需先解压
To z
start with a few commands can be simple processing gzip
of compressed files, such as zcat
: direct printing of compressed files, as well zgrep/zfgrep/zegrep
as in the compressed file directly find
#查询字符串 and displays the first 3 rows of the matched row and the last 3 lines of content fgrep ' Yunjie-talk '-A 3-b 3 log.txt# in the current directory (and subdirectories), all the log files are searched for the string hacked By:find. -name "*.log" | Xargs Fgrep "Hacked by"
Fgrep,grep,egrep Some differences: fgrep
by the original meaning of the string exactly match, the inside of the regular characters as ordinary character parsing, such as: fgrep “1.2.3.4”
only match the IP address: 1.2.3.4, which does not match any character. Fgrep is much faster than grep. GREP uses only regular regular ones. Egrep or GREP-E use the extended regular.
Egrep "One|two" #匹配one或twogrep-e-V ". Jpg|. Png|. Gif|. css|. JS "Log.txt |wc-l
Find all the IP requests from Japan, first extract all the source IP, go to the heavy, find out the Japanese IP, put the file Japan.ip, and then use the command:
Cat Log.gz | gzip-d | Fgrep-f Japan.ip > Japan.log
For the exported file in hive, replace 01
Cat 0000* | Sed ' s/x1//g ' > Log.txt
Other common commands
date
: Command line time manipulation function
sort/uniq
: Sorting, de-weight, statistics
comm
: Two sorted files are compared by row (common line, only on left file, only on right file)
diff
: The similarities and differences of character comparison files, matching cdiff
, similar to the display effect of GitHub
curl/w3m/httpie
: Network requests under the command line
iconv
: File encoding conversions, such as:iconv -份GB2312 -t UTF-8 1.csv > 2.csv
seq
: Produces a sequential sequence that is used with a for loop
Copyright© Wu Hua Jin
Elegance in Gao fishermen singing late, classical leisurely
Style outside the garden green trees linger, fragrant fragrance
Shell command line