International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Python Cookbook Third Edition study Note four: text and string token parsing

Last Update:2017-07-02 Source: Internet

Author: User

Tags aliases tagname

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Text Processing:

Suppose you have a directory, there are various forms of files, there are txt,csv and so on. If you just want to find files in one or more of these formats and open what to do. First of all, be sure to find the file that meets the criteria and then make the path merge open in one by one.

Path=R ' D:\test_source '
Filenames=os.listdir (PATH)
Filenames
Name.endswith ('. txt ')]
Ret
Ret
DIRECT_PATH[0]

The results of the operation are as follows:

[' 1.csv ', ' info.txt ', ' pycharm2.jpg ']

[' Info.txt ']

D:\test_source\info.txt

The Listdir in this code is to list all the file names under that directory. You can see files with txt,csv,jpg in them.

Name.endswith ('. txt ')] This is to find out all of the txt files. The function of using Name.endswith,endswith is to find all the files that meet the suffix criteria.

RET] for path merging. Finally, the complete file path satisfying the condition is obtained.

So, since there is a judgment ending, is there any one who can judge the beginning? StartsWith This is the beginning of judgment.

Name.startswith (' 1 ')]

This will find the file starting with 1.

Then continue to scatter, if I like the following document: If I just want to start with the number of TXT file to find out. How do I find it?

In this case, a regular expression is required to match, but EndsWith and startswith do not match the regular expression. A more powerful feature Fnmatch is described below .

Here's how:

We can see that we used the regular expression method in Fnmatch to find out the TXT file starting with the number .

Fnmatch (name,' [0-9].txt ')]

Look at the following method: This is the file that matches all the start of the PY

Fnmatch (name,' py* ')]

The results are as follows:

[' Py_log.txt ', ' py_result.jpg ']

If we have a text content, we want to change it in the output format. For example, the first line starts with a space of two lines, or the number of characters per line. can use textwrap to achieve

As in the following example:

Textwrap.fill (S, + ) is the number of characters per line set to 110

Textwrap.fill (S,x,initial_indent=") is set to 80 per line , where the first line starts with 3 spaces

Textwrap.fill (S,x,subsequent_indent=") is set to a number of 80 lines, starting with a space from the second line

defText_wrap_try ():
s =Look to  my eyes, look into my eyes, the eyes, the eyes, \
The eyes, not around the eyes, and don ' t look around the eyes, \
Look to my eyes, you ' re under.
Print  Textwrap.fill (S,+)
 print ' \ n '
print    Textwrap.fill (S,initial_indent=")
 print ' \ n '
print    Textwrap.fill (S,subsequent_indent=")

The results are as follows:

String token parsing:

Before speaking about this function, we first introduce the functions of 2 regular expressions. The first is the grouping, the second is a named group usage

First look at grouping. The following is the definition for grouping in the preceding regular expression. The bracketed expressions are grouped together.

Looking at the following code, the string is then (\s+) and (. +?) is one of 2 grouping matches

  def   re_group (): 
     s=   '        Span lang= "en-US" >pattern=re.compile ( r ' < (\s+) class= "H1user" > (. +?) <\/h1> '   ) 
      print   pattern.search (s). Group (0       print   pattern.search (s). Group ( 1) 
      print   pattern.search (s). Group (2)

The result of the operation is as follows: You can see that the Group (0) outputs the entire matched string. Group (1) output is H1 that corresponds to (\s+),Group (2) output is Crifan, that is, corresponding (. +?)

Group1 actually corresponds to the tag of the page code ,group2 actually corresponds to the content of the page code. It is not intuitive to find the corresponding value by index value. Can we give each group a name? This way, the corresponding value is found by name, just like the function of a dictionary. Yes, we use the following regular expressions.

The code is changed to the following:

Re_group ():
s=' Pattern=re.compileR ' < (? p<tag>\s+) class= "H1user" > (? P<text>.+?) <\/h1> ')
Pattern.search (s). Group (0)
Pattern.search (s). Group (' tag ')
Pattern.search (s). Group (' text ')

(\s+) and (. +?) was changed to the (? p<tag>\s+) and (? P<text>.+?). Here's the explanation? P<tagname> meaning, in fact, the meaning of the above explanation at a glance, is to give the group an alias, then in the search for this group can not use the index, directly with this alias can be. The above 2 groups use tag and text as aliases respectively. It is much easier to print grouped content directly using aliases instead of indexes. Here, let's take a look at this kind of advanced usage. Look at the following string, where if we want to match the Python study, there are also Python study fields in the later content. Can we directly refer to the preceding matching groupings?

s1=' <a href= "/tag/python study/" >python study</a> '

The code is as follows: You can use (? P=tagname) directly using the previous tag

Pattern1=re.compile (R ' <a href= "/tag/(? P<tagname>.+?) /"> (? P=tagname) <\/a> ')

After introducing these 2 functions, we are looking at the function of the token:

Suppose we have a string like this:

Text = ' Foo = 23 + 42 * 10 '

We want the following result, which is to break down each expression, such as the equals sign, plus sign, and value

tokens = [(' NAME ', ' foo '), (' EQ ', ' = '), (' NUM ', ' * '), (' PLUS ', ' + '),

(' num ', ' a '), (' Times ', ' * '), (' Num ', 10 ')]

The code we tried is as follows

Pattern_try ():

/* First define each matching pattern */
R ' (? p<name>[a-za-z_][a-za-z_0-9]*) '
R ' (? p<num>\d+) '
R ' (? p<plus>\+) '
R ' (? p<times>\*) '
R ' (? p<eq>=) '
R ' (? p<ws>\s+) '

/* Then summarize all the regular expressions */
Master_pat = Re.compile (' | '). Join ([NAME, NUM, PLUS, Times, EQ, WS]))

/* string scanning using scanner */
Scanner = Master_pat.scanner (' foo = + + *) '
First=scanner.match ()
First.lastgroup,first.group ()
First=scanner.match ()
First.lastgroup,first.group ()
First=scanner.match ()
First.lastgroup,first.group ()
First=scanner.match ()
First.lastgroup,first.group ()

The results are as follows: You can see that matches are found every time the match executes. The lastgroup output a matching character alias, and group () is the exact character to match. From the above you can see that scanner is an iterative object

E:\python2.7.11\python.exe e:/py_prj/python_cookbook.py

NAME Foo

EQ =

The following code can be optimized:

ITER (Scanner.match,none):
M.lastgroup,m.group ()

The resulting output is as follows:

E:\python2.7.11\python.exe e:/py_prj/python_cookbook.py

NAME Foo

EQ =

NUM 23

Plus +

NUM 42

Times *

NUM 10

Python Cookbook Third Edition study Note four: text and string token parsing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

python programming third edition cissp study guide third edition 3rd edition python cookbook 4th edition parsing error unexpected token parsing string practice of system and network administration third edition effective java third edition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Cookbook Third Edition study Note four: text and string token parsing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support