Python Language and Standard library (chapter seventh: Text Processing)

Source: Internet
Author: User

7.1 Use of text processing

Overall, the whole idea behind text processing is to find the target text. Of course, in some cases the data is organized in a structured way, which is called a database. However, some data resources contain information that is not orderly and structured, such as the directory structure of hundreds of files. Text processing is useful when you need to find this type of data or handle them in some way. It can also be found in conjunction with RDBMS (relational database management system).

The two main tools in the field of text processing are directory navigation and a magical technique called regular expressions.

Directory Navigation: A field in which a different operating system really brings a lot of trouble to a simple program, because the families of the three main operating systems organize their directories in different ways, and, most tricky, they use different characters to separate subdirectories. Pyth this issue, a series of cross-platform tools are provided to perform directory and path operations.

Regular Expressions: Is the method of specifying a simple text parser for a factory, when processing any multiline text, the text parser has a low overhead, which means that it is very fast.

First look at some of the reasons why you need to write a text-processing script, and then do some experimentation with new knowledge.

The most common reasons for using regular expressions include:

Search for files

Extracting useful data from program logs, such as Web server logs

Search Email

7.1.1 Searching for files

When you are faced with a task that requires a lot of manual work to handle the data on your computer, consider using Python to write one or two scripts that can save a few hours of tedious work.

A similar situation is the result of today's large hard disk, many users on the hard disk to store files scattered, nor to organize them, when faced with a full file of hard disk, need to extract some know the certainty exists, but do not know the exact location of the information, the situation will be very bad. So, like many companies offer desktop search functionality.

You can think of Python as an enhanced Desktop Search feature.

7.1.2 Log Clips

Another common text Processing task in System management is the need to filter logs to get various information. A script that filters logs can be a temporary idea to answer a specific question. When is the email statement sent, or the last time my program recorded a specific message? Or they may be a permanent part of a data processing system that can manage ongoing tasks over time. For example, they can be part of systems management and performance monitoring systems, scripts filter logs by rules, get a specific subset of information, which is often referred to as a log clip, and the idea is that it's like clipping a polygon to fit the screen, or cropping the log to fit any system perspective you need.

7.1.3 Message Filtering

The final text Processing task you've probably found is useful: find something that the normal inbox search can't find by processing the mailbox file.

However, using text processing and other techniques can be a very useful way to get data from an external data source, such as a Web page or other data source, such as a database, to cross-reference data or to complete other tasks in the search that cannot be done with a regular mail client. Information that can be used in other ways that are not easy to find.

7.2 Navigating the file system using the OS module

For everyday tasks that must be performed on many different platforms, the OS module and his submodule Os.path are one of the most useful tools.

One of the difficulties in writing cross-platform scripting is that Windows uses a backslash (\) to separate directory names, but Unix uses (/). Also, Python uses backslash characters to represent special text, which complicates the script that creates file paths on Windows.

However, Python's Os.path module can get some out-of-the-box functions that allow you to split and link pathname with the correct characters and work correctly on any operating system that is using Python. You can iterate over the directory structure with a single function and invoke another function of its own choice for each file it finds in the hierarchy. You will see many of these functions in the example below, but first look at a summary of some useful functions that will be used in the OS and Os.path modules.

List files and process paths

(1) Import the OS and Os.path modules in the Python interpreter:

>>>Import os,  Os.path

(2) First look at which file system you are running

>>>os.getcwd ()'c:\\python31'

(3) If you want to work with a program, you can split it into a tuple of paths

(4) In order to find some information about the directory or any file of interest, use Os.start:

Note: The name is "." Directory is a simple representation of the current directory.

(5) If you actually want to list the files in the directory, you can use the following code:

Note two: 1.listdir, split, and star construct an iterative script, but do not need to do so, Os.path provides the walk function to implement this function. 2. The stat call from the system call is very opaque, and it returns a tuple that corresponds to the structure returned by the library function with the same name as POSIXC.

Search for special types of files

If you use a different programming language, you will feel how easy it is to search for files using Python. Os.walk can do all the heavy lifting of file system iterations, you simply write a simple function that executes parameters on the results of its DAO.

(1) Using your favorite text editor, open a script named scan_pdf.py in the directory where you want to scan the PDF file, and enter the following code:

(2) Run the script:

This is a nice little script, Python does all the work, so that you get a list of the PDF files in the directory, including their location and full name, location even with spaces, on UNIX and Linux this is difficult to deal with.

Note that a very simple regular expression is used in the code to check the end of the file name, or you can use Os.path.splitext to get a tuple that includes the base name and extension of the files, and compare them with the PDF, which may be clearer.

Form R "<string constant>" simply tells Python that string constants should prohibit any special handling of backslashes, so that "\ n" is a single letter length, equivalent to a new line of string, and R "\ n" is a two-letter length string, Represents a backslash character followed by a letter "n". Because regular expressions always include a number of backslashes.

Improved search

You also want to exclude all PDF files that have a space in the name, for example, because the files you are searching for are downloaded from the Web, and there are actually no spaces, but many of the accepted e-mail messages contain files that are attachments to others ' file systems, so they often contain spaces, so this improvement is appropriate.

ImportOS, Os.pathImportRedefprint_pdf (Arg, dir, files): forFileinchFiles:path=os.path.join (dir, file) path=os.path.normcase (path)if  notRe.search (R". *\.pdf", path):Continue     ifRe.search (R" ", path):Continue     Print(PATH) forRoot, dirs, filesinchOs.walk ('.'):

(2) Now run the modified script. The same output will be inconsistent with your system.

This code has a format modification that works well when used as a filter script for fast text processing. Look at the Print_pdf function in the code, which first establishes and normalizes the path name, and then runs tests on them to ensure that the path name of the ivory is obtained.

7.3 Using regular expressions and re modules

A regular expression defines a simple parser that can match strings in text. When you use regular expressions to specify multiple files on the command line, they are essentially working in the same way as wildcard characters.

There are two major differences between regular expressions and simple wildcard characters:

The regular expression can be matched more than once in any position of a long string.

Regular expressions are much more complex and much richer.

Practice Regular Expressions

Using the filter function, a function with one parameter is applied to each member of its input list, and Re.match and Re.search have two parameters, so you have to use a function definition or an anonymous lambda form.

(1) Open the Python parser and import the RE module:

$ python Import RE

(2) Now, use a variety of regular expressions to define a list of strings to filter:

 >>>s = ( xxx   ' ,  " abcxxxabc  , "  xyx  ,  '  abc   ' ,  '  x.x   ' ,  '  axa   ' ,  '  axxxa   ' ,  '  axxya   ' ) 

(3) The simplest regular expression is executed first:

>>>a = Filter ((lambda s:re.match (R"xxx">>>  Print(*a) xxx

(4) Why didn't you find "Axxxxa"? This is because, in Python, the re.match function has been searched for matches since it was entered. This requires Re.search:

(5) Search for that statement:

(6) The following code shows how to match only the period (by escaping special characters):

(7) You can also use the asterisk:

(8) Why Axxya also matches success, because * represents 0 or more characters, and if you do want to make sure that there are characters between two x, you can use a plus sign.

(9) If you match any string that has a "C" in it:

(10) How do I match any character that does not have "C"? The regular expression uses square brackets to denote the special character set to match.

(11) This matches the entire list. The above does not put ^ with to the place, in order to clearly illustrate, you can filter a list with a lot of C:

(12) In order to really match any string that does not have a "C", you must use the ^ and $ special characters at the beginning and end of the string, and then tell re that you want a string that contains no C characters from start to finish:

To add a test:

(1) Use your favorite text editor again to open scan_pdf.py and make the following changes. The modified part is represented in italics:

(2) Now run the modified script, this output will be inconsistent with the output of your system:

In this example, the test searches only for a file name with a. Hu in the name (which includes the full path). The assumption here is that a. hu file is a Hungarian country code. So this example shows how to narrow the search and search only for files translated from Hungarian.

Python Language and Standard library (chapter seventh: Text Processing)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.