1, Python web crawler to get Taobao commodity price code:
#-*-coding:utf-8-*-' Created on March 17, 2017 @author: Lavi ' "Import requests from BS4 import BeautifulSoup import BS4 I Mport re def gethtmltext (URL): try:r = Requests.get (url,timeout=30) r.raise_for_status R.enco Ding = r.apparent_encoding return r.text except:return "" Def Parserpage (goodslist,html): TLT = R E.findall (R ' \ "view_price\" \:\ "[\d\.] *\ "', html" PLT = Re.findall (R ' \ "Raw_title\" \:\ ". *?\" ', HTML) #添加问号使用最小匹配的 for I in range (len (TLT)): title =
The eval (tlt[i].split (': ') [1]) #eval () function is very powerful and can evaluate string str as a valid expression and return the result of the calculation price = eval (plt[i].split (': ') [1]) Goodslist.append ([Title,price]) def printPage (goodslist): tplt= "{: 6}\t{:8}\t{:16}" Print (Tplt.format ("Serial number", "Price", "product
Name ")) for I in Range (len (goodslist)): goods = goodslist[i] Print (Tplt.format (i+1,goods[0],goods[1))
def main (): goods = "schoolbag" depth = 2;
url = "https://s.taobao.com/search?q=" goodslist = [] For I in range (depth): HTML = Gethtmltext (url+goods+ "&s=" +str (i*44)) Parserpage (goodslist, HTML) PrintPage (goodslist) Main ()
2. The eval () function uses the extension
The eval () function is powerful, as the official demo explains: String STR is evaluated as a valid expression and returns the result of the calculation. So, combining math is good for a calculator. In addition, list,tuple,dict and string can be converted to each other.
A = "[[1,2], [3,4], [5,6], [7,8], [9,0]]"
B = eval (a)
b
out[3]: [[1, 2], [3, 4], [5, 6], [7, 8], [9, 0]]
t Ype (b)
out[4]: list
a = "{1: ' A ', 2: ' B '}"
B = eval (a)
b
Out[7]: {1: ' A ', 2: ' B '}
type (b)
Ou T[8]: dict
a = "([1,2], [3,4], [5,6], [7,8], (9,0))"
B = eval (a)
b
out[11]: ([1, 2], [3, 4], [5, 6], [ 7, 8], (9, 0))
The eval () function can be seen as powerful, but security is also a fatal disadvantage. Think about this use environment: requires the user to enter an expression and evaluate it. If the user maliciously enters:
__import__ (' OS '). System (' dir ')
Then, after Eval (), you will find that the current directory files are present in front of the user. Then continue typing:
Open (' filename '). Read ()
The code has been read by people. Get finished, a delete command, the file disappears.
3. Minimum matching of Python regular expressions
Python's regular expression re module uses greedy matching, or maximum matching, by default. But we also have the need to use the minimum match when the following is to see what is the bottom of the match, and how to achieve a minimum match:
The shortest match applies to: if there is a piece of text, you just want to match the shortest possible, not the longest.
Example
For example, there is a section of HTML fragment, ' \this is the label\\the second label\ ', how to match the contents of each a tag, the shortest and the longest difference is below.
Code
Import re
>>> str = ' <a>this is-a-label</a><a>the second label</a> '
> >> Print Re.findall (R ' <a> (. *?) </a> ', str ' # shortest match
[' is ', ' ' second label ']
>>> print re.findall (R ' <a& gt; (. *) </a> ', str) [' This is the '
label</a><a>the second label ']
explain
example, the Pattern R ' (. *?) ' The intent is to match the contained text, but the * operator is greedy in the regular expression, so the matching operation finds the longest possible.
But after the * operator Plus. operator so that the match becomes a non greedy pattern, resulting in a shortest match.
Resources:
1, Chinese university Moocpython web crawler and Information extraction course
2, HTTP://WWW.TUICOOL.COM/ARTICLES/BBVNQBQ
3, http://www.cnblogs.com/jhao/p/5989241.html