Hadoop Engineer interview: Use Linux command to complete text processing __linux

Source: Internet
Author: User
Tags diff

There are two files A.txt and b.txt in a directory, the file format is (ip,username),
Columns such as:
A.txt

127.0.0.1 Zhangsan
127.0.0.1 Wangxiaoer
127.0.0.2 Lisi
127.0.0.3 Wangwu

B.txt

127.0.0.4 Lixiaolu
127.0.0.1 Lisi

At least 1 million lines per file, use the Linux command to do the following:

1 The respective number of IP per file

2) appearing in B.txt without appearing in A.txt IP

3 The number of times each user appears and the number of IP corresponding to each user

Answer: 1)
Cat A.txt | Awk-f "" ' {print $} ' | Sort-u > Ipa.txt
Wc-l Ipa.txt

Cat B.txt | Awk-f "" ' {print $} ' | Sort-u > Ipb.txt
Wc-l Ipb.txt

2) diff Ipa.txt Ipb.txt | grep \> | Awk-f "" ' {print $} '

3) Cat A.txt B.txt | Awk-f "" ' {u[$1 "" $2]+=1} end{for (i in U) print I,u[i]} '
| Awk-f "" ' {u[$2]+=$3;ip[$2]+=1} end{for (i in U) print I,u[i],ip[i]} '

Introduction to Linux commands:
1\awk
Awk-f "" ' begin{} {} end{} '
such as: Cat A.txt | Awk-f "" ' {u[$1]+=1} end{for (i in U) print I,u[i]} '
Note: 1)-F "": f is the separator, followed by the separator, enclosed in double quotes, do not specify the delimiter by default by using space-delimited
2 The statement in single quotation marks-' begin{} {} end{} ' can be divided into three parts:
Begin to represent the statement executed before {} and execute only once;
The statement in {} is executed once for each row in the file;
End indicates that all {} statements are executed only once after execution.

2\grep
grep str file: Finds all rows containing str strings from file and returns the contents of the row
Grep-r UPDATE/ETC/ACPI: Finds the file containing the string "Update" in all files under the specified directory/ETC/ACPI and its subdirectories (if a subdirectory exists), and prints out the contents of the line that contains the string

3\ Vertical line ' | ' Represents the pipeline, the output of the previous command as the input to the following statement

4\sort
Sort file: Increments the file content (not heavy), but does not modify the source file
Sort-r file: Sort the file content in descending order (not to heavy), without changing the source files
Sort-u file > NewFile: Sort the file contents (go Heavy), but do not modify the source file, but enter the NewFile

5\ ' > '
' > ' file: Indicates output to file
To indicate that the character is greater than ' > ', use \> escape.

6\wc
Wc-l (–lines) file: How many rows are there in the statistics file
Wc-c (–chars,–bytes) file: How many bytes a statistic file contains

7\diff
Diff Filea Fileb
Note: 1 If Filea\fileb are ordered, the order can be misaligned; 2 If there is a disorder, then strictly according to the line;
Set the contents of Ipa.txt: Orderly
127.0.0.1
127.0.0.2
127.0.0.3
Set the contents of Ipb.txt: Orderly
127.0.0.1
127.0.0.3
127.0.0.4
Diff Ipa.txt Ipb.txt-y-w 50 (misaligned alignment)
127.0.0.1 127.0.0.1
127.0.0.2 <
127.0.0.3 127.0.0.3
> 127.0.0.4

Then use the grep \> to remove the IP in the IPB that is not in the IPA

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.