UNIX® originates from simple text processing and retains one of the most powerful text-processing tools in its command-line environment. By combining a series of simple commands to complete complex text transformations, UNIX provides tools that allow you to build almost any required text processing engine.
Introduction
At the beginning of the UNIX®, people were not familiar with the new operating system, but they soon found the right entry point, and university researchers needed a decent text-processing environment. Because at that time, the computer processing speed and memory capacity is limited, so the program must be very small, and relatively simple. This creates the famous design idea in UNIX: "A set of tools work together to accomplish a task." Several small, but powerful text-processing tools are grouped together through UNIX pipelines to convert and manipulate text in various ways.
In this article, you'll learn a little about getting text from files and programs, making simple transformations with the TR command, and using the SED command for complex search and replace operations. You will then do this again using Perl programming and scripting languages, so you realize that Perl is very powerful and can replace the TR and SED commands.
Before you start
If you want to experiment with the examples in this article, make sure you can use the UNIX command-line environment. This may be a terminal emulator on the local computer (often called a terminal in a modern desktop, if you are accustomed to using Windows®, you can use Cygwin), or a remote system that is accessed via SSH.
The shell syntax used for this article is for GNU Bash, please refer to your shell manual (or consider using Bash) for the specific syntax you need to use.
Perform various actions on text
Before you start using the various text utilities in UNIX to manipulate text, you need to know how to get the text. And before you do this, you need to understand UNIX's standard input/output (I/O) streams.
The standard C library (thus, each UNIX program) defines three standard streams: input, output, and error. They are sometimes referred to as stdin, stdout, and stderr, which are global variables used in all C programs to represent them.
When you redirect the program output to a file using the > operator in the Shell, you can send its standard output (stdout) stream to this file. For example: ls > This-dir sends the output of LS to a file named This-dir.
When you redirect program input to a file using the < operator in the Shell, you can enter the contents of the file into the standard input (stdin) stream for that program. For example, sort < this-dir can read content from a file named This-dir and use it as input to the sort command.
Another operator that is commonly used to redirect standard streams is "|" (pipe) operator, which can link the standard output of the left program to the standard input stream of the right program. For example: LS | Sort and the previous two examples do the same task, and the output of LS is entered directly into the sort command without the need for temporary files.
If you look closely, you may find that the standard error (STDERR) stream does not appear in the preceding examples. As with standard output streams, stderr can be redirected or piped, but you need to tell the Shell that you want to handle stderr instead of stdout.
You can use the 2> operator to redirect standard error streams to a file. You will often see this operator when working with commands that generate useful error outputs, such as make tools for compiling UNIX programs: Make 2> build-errors.
This command runs make and sends any error messages to the Build-errors file. Similarly, you can use the 2| Pass the stderr through a pipe to another program.