Python Challenge 3-urllib & Re

Source: Internet
Author: User
Tags character classes locale setting

Third topic Address: http://www.pythonchallenge.com/pc/def/ocr.html
Hint1:recognize the characters. Maybe they is in the book, but maybe they is in the page source.
Hint2: The gaze of the Web source code is: Find rare characters in the mess below; The following is a bunch of characters.
It is obvious that the least number of occurrences are found in this pair of characters; notice that the whitespace character is ignored. Characters with the same number of occurrences are sorted by occurrences.


Import Reimport urllib# urllib to open the websiteresponse= urllib.urlopen ("http://www.pythonchallenge.com/pc/def/ Ocr.html ") Source = Response.read () response.close ()
# crawl to the entire HTML sourceprint source
# get all the elements in your gaze
data = Re.findall (R ' <!--(. +?) --', source, re. S
# get the letter CharList = Re.findall (R ' ([A-za-z]) ', data[1], +) print Charlistprint '. Join (CharList)

Finally, the result is

[' E ', ' Q ', ' u ', ' a ', ' l ', ' I ', ' t ', ' Y ']equality


############################################################################################################### #####################


The Python Urllib Library provides a feature for obtaining web page data from a specified URL address and then analyzing it.

import urllibgoogle = Urllib.urlopen (' http://www.google.com ') print ' HTTP header:\n ' , Google.info () print ' HTTP status: ', Google.getcode () print ' URL: ', Google.geturl () # resulthttp Header:Date:Tue, Oct 20 19:30:35 gmtexpires: -1cache-control:private, max-age=0content-type:text/html; CHARSET=ISO-8859-1SET-COOKIE:PREF=ID=521BC5021BB6E976:FF=0:TM=1413919835:LM=1413919835:S=7CBCQWNHLCPJFOIW; Expires=thu, 20-oct-2016 19:30:35 GMT; path=/; domain=.google.comset-cookie:nid=67= Mzfycxobc3d9vaqc6-cxkicbxt4eekorve6lon1zhqhlevxasd2oerkeg2in90zraqnpq1xlfzr_ Ha1ife0jqdjankdexwafjziqn2mlgjavwcfmbyetbffist08intr; expires=wed, 22-apr-2015 19:30:35 GMT; path=/; domain=.google.com; Httponlyp3p:cp= "This was not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info. " Server:gwsx-xss-protection:1; Mode=blockx-frame-options:sameoriginalternate-protocol:80:quic,p=0.01http status:200url:http://www.google.com 
We were able to crawl the page with Urlopen and then read the method to get all the information.

Info gets the HTTP header and returns a httplib. The Httpmessage object. Represents the header information returned by remote server.

GetCode gets the HTTP status. Assuming it is an HTTP request, 200 indicates success. 404 means the URL is not found.

Geturl access to the information source site.

There are also getenv to get environment variables. PUTENV environment variable settings. Wait a minute.

Print Help (Urllib.urlopen) #resultHelp on function urlopen in module urllib:urlopen (URL, Data=none, Proxies=none)    Create a File-like object for the specified URL to read from.
Above. We can know that it is to create a class file object to read for the specified URL.

The parameter URL represents the path to the remote data. Usually an HTTP or FTP path

The parameter data indicates that the URL data is submitted as a GET or post method

Parameter proxies represents the settings for the proxy


Urlopen returns a class file object

There is read (), ReadLine (). ReadLines (), Fileno (). Close () The same way as the file object



############################################################################################################### #####################


The re-normal form module in Python

Re.match String Matching pattern

Import reline = "Cats is smarter than dogs" Matchobj = Re.match (R ' (. *) is (. *?). * ', line, re. M|re. I) If matchobj:   print "Matchobj.group ():", Matchobj.group ()   print "Matchobj.group (1):", Matchobj.group (1)   print "Matchobj.group (2):", Matchobj.group (2) Else:   print "No match!!"
The result of the above code is

Matchobj.group ():  Cats is smarter than Dogsmatchobj.group (1):  Catsmatchobj.group (2):  Smarter
Can see. Group () returns the entire match object. Group (?) Able to return Submatch, the above code has two matching points.

Main Function Statement Re.match (pattern, string, flags)

The pattern is written regular expression is used for matching.

String is the incoming need to be matched to the value.

Flags can not write. able to use | Separated.

Re. I or re. IGNORECASE that matches some uppercase and lowercase. Case insensitively.

(performs case-insensitive matching. )

Re. s or re.dotall, which means random match pattern, change '. ' Behavior that is set to match \ n

(makes a period (dot) match any character, including a newline. )

Re. M or re.multiline, indicating multiline mode. Change the behavior of ' ^ ' and ' $ '

(makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just th e start of the string). )

Re. L or Re.locale. Enables predefined character classes \w,\w, \b, \b, \s, \s depending on the current locale setting

(interprets words according to the current locale. This interpretation affects the alphabetic group (\w and \w), as well as word boundary behavior (\b and \b). )

Re. U or Re.unicode, so that the character class is defined in advance \w,\w, \b, \b, \s, \s depending on the UNICODE definition of the character attribute

(interprets letters according to the Unicode character set. This flag affects the behavior of \w, \w, \b, \b. )

Re. X or Re.verbose. Specific mode. This mode can be multi-line. Ignores whitespace characters and is able to increase gaze.

(permits "cuter" regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker.< /c0>)


Re.search V.s. Re.match

Import reline = "Cats is smarter than dogs"; matchobj = Re.match (R ' Dogs ', line, re.) M|re. I) If matchobj:   print "match-and Matchobj.group ():", Matchobj.group () Else:   print "No match!!" Searchobj = Re.search (R ' Dogs ', line, re. M|re. I) If searchobj:   print "Search-and Searchobj.group ():", Searchobj.group () Else:   print "Nothing found!!"   # Resultno match!! Search--Searchobj.group ():  dogs
We can see that match is to check the entire string from the beginning, assuming that it was not found or not found.

Search looks for a complete string. Through



Re.sub

The detailed statements such as the following

Re.sub (Pattern, Repl, String, max=0)

Replace all of the match parts of string for REPL, replacing all of the known replacements for Max.

It then returns a modified string.

Import Rephone = "2004-959-559 # This is Phone number" # Delete Python-style commentsnum = re.sub (R ' #.*$ ', "", Phone) print  "Phone num:", num# Remove anything other than Digitsnum = Re.sub (R ' \d ', "", phone) print "Phone num:", num# Resultphone Num:  2004-959-559 Phone num:  2004959559


Re.split (Pattern, string, maxsplit=0)

Ability to use Re.split to cut strings. The Maxsplit is the number of separations, and the maxsplit=1 represents the separation once. The default is 0, without limiting the number of times.

Import reprint re.split (' \w+ ', ' Words, Words, Words. ') Print Re.split (' (\w+) ', ' Words, Words, Words. ') Print re.split (' \w+ ', ' Words, Words, Words. ', 1) # result[' Words ', ' Words ', ' Words ', ' ' [' Words ', ', ', ' Words ', ', ', ' wor ' DS ', '. ', ' ' [' Words ', ' Words, Words. ']

If a match is made at the beginning or end of a string, the returned list starts or ends with an empty string.

Import reprint Re.split (' (\w+) ', ' ... ' words, words ... ') # result[', ' ... ', ' words ', ', ', ' words ', ' ... ', ']

If the string does not match, it returns the list of the entire string.

Import reprint Re.split (' A ', ' ... words, words ... ') # result[' ... words, words ... ')

####

Str.split (' \s ') and Re.split (' \s ', str) are all cut strings and returned to list. But there is a difference.

1. Str.split (' \s ') is literally cutting strings according to ' \s '

2. Re.split (' \s ', str) is cut according to the blanks. Because the ' \s ' in the normal form is a blank meaning.



Re.findall (Pattern, string, flags=0)

Find all the substrings of the re match and return them as a list. This match is returned from left to right in an orderly manner. An empty list is returned if there is no match.

Import reprint Re.findall (' A ', ' bcdef ') print Re.findall (R ' \d+ ', ' 12a34b56c789e ') # result[][' 12 ', ' 34 ', ' 56 ', ' 789 ']




Re.compile (pattern, flags=0)

Compile the normal table, return the Regexobject object, and then call the match method or the search method through the Regexobject object.


Prog = Re.compile (pattern) result = Prog.match (string) Equivalent result = Re.match (pattern, string)
The first method enables the reuse of regular expressions.

Python Challenge 3-urllib &amp; re

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.