Python Cookbook Third Edition study Note four: text and string token parsing

Source: Internet
Author: User
Tags aliases tagname

Text Processing:
Suppose you have a directory, there are various forms of files, there are txt,csv and so on. If you just want to find files in one or more of these formats and open what to do. First of all, be sure to find the file that meets the criteria and then make the path merge open in one by one.
Path=R ' D:\test_source '
Filenames=os.listdir (PATH)
Filenames
Name.endswith ('. txt ')]
Ret
Ret
DIRECT_PATH[0]
The results of the operation are as follows:
[' 1.csv ', ' info.txt ', ' pycharm2.jpg ']
[' Info.txt ']
D:\test_source\info.txt
The Listdir in this code is to list all the file names under that directory. You can see files with txt,csv,jpg in them.
Name.endswith ('. txt ')] This is to find out all of the txt files. The function of using Name.endswith,endswith is to find all the files that meet the suffix criteria.
RET] for path merging. Finally, the complete file path satisfying the condition is obtained.
So, since there is a judgment ending, is there any one who can judge the beginning? StartsWith This is the beginning of judgment.
Name.startswith (' 1 ')]
This will find the file starting with 1.
Then continue to scatter, if I like the following document: If I just want to start with the number of TXT file to find out. How do I find it?

In this case, a regular expression is required to match, but EndsWith and startswith do not match the regular expression. A more powerful feature Fnmatch is described below .  
Here's how:
We can see that we used the regular expression method in Fnmatch to find out the TXT file starting with the number .
Fnmatch (name,' [0-9].txt ')]
Look at the following method: This is the file that matches all the start of the PY
Fnmatch (name,' py* ')]
The results are as follows:
[' Py_log.txt ', ' py_result.jpg ']
If we have a text content, we want to change it in the output format. For example, the first line starts with a space of two lines, or the number of characters per line. can use textwrap to achieve
As in the following example:
Textwrap.fill (S, + ) is the number of characters per line set to 110 
Textwrap.fill (S,x,initial_indent=") is set to 80 per line , where the first line starts with 3 spaces 
Textwrap.fill (S,x,subsequent_indent=") is set to a number of 80 lines, starting with a space from the second line  
defText_wrap_try ():
s =Look to my eyes, look into my eyes, the eyes, the eyes, \
The eyes, not around the eyes, and don ' t look around the eyes, \
Look to my eyes, you ' re under.
Print
Textwrap.fill (S,+)
print ' \ n '
print
Textwrap.fill (S,initial_indent=")
print ' \ n '
print
Textwrap.fill (S,subsequent_indent=")
The results are as follows:

String token parsing:
Before speaking about this function, we first introduce the functions of 2 regular expressions. The first is the grouping, the second is a named group usage
First look at grouping. The following is the definition for grouping in the preceding regular expression. The bracketed expressions are grouped together.
?
Looking at the following code, the string is 

then (\s+) and (. +?) is one of 2 grouping matches

  def   re_group (): 
    s= '

    Span lang= "en-US" >pattern=re.compile ( r ' < (\s+) class= "H1user" > (. +?) <\/h1> ' )
    print pattern.search (s). Group (0      print pattern.search (s). Group ( 1)
    print pattern.search (s). Group (2)

The result of the operation is as follows: You can see that the Group (0) outputs the entire matched string. Group (1) output is H1 that corresponds to (\s+),Group (2) output is Crifan, that is, corresponding (. +?)      

Group1 actually corresponds to the tag of the page code ,group2 actually corresponds to the content of the page code. It is not intuitive to find the corresponding value by index value. Can we give each group a name? This way, the corresponding value is found by name, just like the function of a dictionary. Yes, we use the following regular expressions. 

The code is changed to the following:
Re_group ():
s='

Pattern=re.compileR ' < (? p<tag>\s+) class= "H1user" > (? P<text>.+?) <\/h1> ')
Pattern.search (s). Group (0)
Pattern.search (s). Group (' tag ')
Pattern.search (s). Group (' text ')

(\s+) and (. +?) was changed to the (? p<tag>\s+) and (? P<text>.+?). Here's the explanation? P<tagname> meaning, in fact, the meaning of the above explanation at a glance, is to give the group an alias, then in the search for this group can not use the index, directly with this alias can be. The above 2 groups use tag and text as aliases respectively. It is much easier to print grouped content directly using aliases instead of indexes. Here, let's take a look at this kind of advanced usage. Look at the following string, where if we want to match the Python study, there are also Python study fields in the later content. Can we directly refer to the preceding matching groupings?
s1=' <a href= "/tag/python study/" >python study</a> '
The code is as follows: You can use (? P=tagname) directly using the previous tag
Pattern1=re.compile (R ' <a href= "/tag/(? P<tagname>.+?) /"> (? P=tagname) <\/a> ')
After introducing these 2 functions, we are looking at the function of the token:
Suppose we have a string like this:
Text = ' Foo = 23 + 42 * 10 '
We want the following result, which is to break down each expression, such as the equals sign, plus sign, and value

tokens = [(' NAME ', ' foo '), (' EQ ', ' = '), (' NUM ', ' * '), (' PLUS ', ' + '),

(' num ', ' a '), (' Times ', ' * '), (' Num ', 10 ')]
The code we tried is as follows
Pattern_try ():
/* First define each matching pattern */
R ' (? p<name>[a-za-z_][a-za-z_0-9]*) '
R ' (? p<num>\d+) '
R ' (? p<plus>\+) '
R ' (? p<times>\*) '
R ' (? p<eq>=) '
R ' (? p<ws>\s+) '
/* Then summarize all the regular expressions */
Master_pat = Re.compile (' | '). Join ([NAME, NUM, PLUS, Times, EQ, WS]))
/* string scanning using scanner */
Scanner = Master_pat.scanner (' foo = + + *) '
First=scanner.match ()
First.lastgroup,first.group ()
First=scanner.match ()
First.lastgroup,first.group ()
First=scanner.match ()
First.lastgroup,first.group ()
First=scanner.match ()
First.lastgroup,first.group ()
The results are as follows: You can see that matches are found every time the match executes. The lastgroup output a matching character alias, and group () is the exact character to match. From the above you can see that scanner is an iterative object
E:\python2.7.11\python.exe e:/py_prj/python_cookbook.py
NAME Foo
EQ =
The following code can be optimized:
ITER (Scanner.match,none):
M.lastgroup,m.group ()
The resulting output is as follows:
E:\python2.7.11\python.exe e:/py_prj/python_cookbook.py
NAME Foo
EQ =
NUM 23
Plus +
NUM 42
Times *
NUM 10

Python Cookbook Third Edition study Note four: text and string token parsing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.