Reptile Chapters
1.python How to access the Internet
URL(web address)+lib="urllib
2. Check Documentation: Python document.
3.response = Urllib.request.urlopen ("" www.baidu.com)
HTML =html.decode ("Utf-8") to break binary decoding operations
4. read Web images WB: Binary
Urlopen=request + Urlopen
- Browser -- Review element -- view Network(python submits POST form )
(Browser and client communication content)
Get: Request data from server
POST: Submit processed data to the specified server
8 min to click POST(translate?) smartresult)
Click to open it. Preview will be able to see the translated content.
(1) then analyze the contents of headers
①Status Code: The meaning of normal response 404 Indicates an abnormal response
②Request Headers server. Typically, the following user-agent Identify whether the browser accesses or code access
③Form Data: main content of POST submission
(2) POST requires the specified data format, which can be converted via the parse format
(3) code for translating POST function
Import Urllib.request
Import Urllib,parse
Url= '//headers the Request URL below
data={}// is Headers below the from Data
Type:auto
I:l Love fishc.com!
Doctype:json
xmlversion:1.6
Keyfrom:fanyi.web
Ue:utf-8
Typorsult:true
Response = Urllib.request.urlopen (url,data)
Modify:
data[' type '] = ' AUTO '
data[' i ']= ' I love fishc.com! '
data[' doctype ' = ' json '
data[' xmlversion ' = ' 1.6 '
data[' keyfrom '] = ' fanyi.web '
data[' UE '] = ' UTF-8 ' UTF-8 is a flexible form of encoding for Unicode
data[' typorsult ' = ' true '
Urllib.parse.urlencode (data)// encoded form
Response = Urllib.request.urlopen (url,data). Encode (' Utf-8 ')//python default is Unicode form
Html = Response.read (). Decode (' utf-8 ')// convert other encoded forms into Unicode form
Print (HTML)
the output is a JSON structure
Import JSON
Json.loads (HTML)// loaded after discovery is a dictionary
Target = json.loads (HTML)
Type (target)
output:<class ' dict ' > dictionary form
target[' Translateresult ']
Output: [[{' TGT ': ' i love [email protected] ', ' src ': ' I loved fishc.com! '}]
>>>target[' Translateresult '][0][0][' TGT ']
Output: ' I love fishc.com! '
After the above analysis, the code is as follows:
Import Urllib.request
Import Urllib,parse
Import JSON
Content = input (" Please enter what you want to translate :")
URL = ' http:..fanyi.youdao.com/translagte?smartresult=dict&smartresult=;xxxxx '
Data = {}
data[' type '] = ' AUTO '
data[' i ']= ' I love fishc.com! '
data[' doctype ' = ' json '
data[' xmlversion ' = ' 1.6 '
data[' keyfrom '] = ' fanyi.web '
data[' UE '] = ' UTF-8 ' UTF-8 is a flexible form of encoding for Unicode
data[' typorsult ' = ' true '
Urllib.parse.urlencode (data). Encode (' utf-8 ')// encoded form
Response = Urllib.request.urlopen (url,data)
Html = Response.read (). Decode (' Utf-8 ')
Target = json.loads (HTML)
Print (" translation result :%s"% (target[' translateresult '][0][0][' TGT '))
- Modify Headers emulation Browser Login
(1) method one: Modified by the headers parameter of Request
Add code
Head = {}
head[' user-agent ' = ' copy the user-agent in the head of the code . '
(2) method Two: Modify by Request.add_header() method
in req = Urllib.request.Request (url,data,head)
Req.add_header (' user-agent ', ' that site ')
- Methods to prevent Web sites from setting permissions for multiple accesses to the same IP over a short period of time
(1) Delayed access time ( not recommended )
(2) Agent
The ① parameter is a dictionary {' type ': ' proxy IP: port number '}
Proxy_support = Urllib.request.ProxyHandler ({})
② customizing, creating a opener
Opener = Urllib.request.build_opener (proxy_support)
③A. installation opener
Urllib.request.install_opener (opener)
B. call opener
Opener.open (URL)
(direct search for proxy IP is ok)
Import Urllib.request
URL = ' http://www.whatismyip.com.tw '
Proxy_support = Urllib.request.ProxyHandler ({' http ': ' 119.6.144.73:81 '})
Opener = Urllib.request.build_opener (proxy_support)
Urllib.request.install_opener (opener)
Urllib.request.urlopen (URL)
Html = Response.read (). Decode (' Utf-8 ')
Print (HTML)
④ Modify the opener headers as follows :
Import Urllib.request
URL = ' http://www.whatismyip.com.tw '
Proxy_support = Urllib.request.ProxyHandler ({' http ': ' 119.6.144.73:81 '})
Opener = Urllib.request.build_opener (proxy_support)
Opener.addheaders = [(' User-agent ', ' is the content of that webpage ')]
Urllib.request.install_opener (opener)
Urllib.request.urlopen (URL)
Html = Response.read (). Decode (' Utf-8 ')
Print (HTML)
⑤ Building multiple IPs
Import Urllib.request
Import Random// use arbitrary, use IP randomly
URL = ' http://www.whatismyip.com.tw '
IPList = [' 119.6.144.73l81 ', ' 183.203.208.166:8118 ', ']
Proxy_support = Urllib.request.ProxyHandler ({' http ': ' Random.choice (IPList '})
Opener = Urllib.request.build_opener (proxy_support)
Opener.addheaders = [(' User-agent ', ' is the content of that webpage ')]
Urllib.request.install_opener (opener)
Urllib.request.urlopen (URL)
Html = Response.read (). Decode (' Utf-8 ')
Print (HTML)
- Crawl the first 10 pages of the site and save the image (modular)
Import Urllib.request
Import OS// Create folder in current directory
Def get_page (URL);
Req = urllib.request.Request (URL)
Req.add_header (' user-agent ', ' head URL ')
Response
Def find_imgs (URL):
Pass
Def Save_imgs (Folder,img_addrs):
Pass
Def download_mm (folder= ' Ooxx ', pages=10)://pages means download page
Os.mkdir (folder)// name of folders created
Os.chdir (folder)
Url= "Http://jandan.net/ooxx"
Page_num =int (get_page (URL))
For I in range (pages):
Page_num-=i
Page_url = url+ ' page-' +str (page_num) + ' #comments '// the address has been acquired at this time
Next, open the website to get the address of the picture
Img_addrs = Find_imgs (Page_url)
Save_imgs (Img_addrs)// be familiar with modular thinking
If __name__ = = ' __main__ ':
DOWNLOAD_MM ()
Additional: If __name__ = = ' __main__ ' effect: Test module availability.
9. If you want to invoke a function, simply type it.
For example, I define
Def url_open (URL):
Req = urllib.request.Request (URL)
Req.add_header (' user-agent ', ' address ')
Response = Urllib.request.urlopen (URL)
Html = Response.read ()
Return html
if you want to use a URL in def get_page (URL)
just the direct html = url_open (URL). Decode (' Utf-8 ')
In other words, this is the same as the mathematical function reference.
- Regular expressions
(1) Search method is found in the content of the first successful match location;
(2). : Represents any file name (excluding line breaks), wildcard characters
(3) if you want to find . . Just use \.
(4) \d: Match all numbers
(5) matching IP address:\d\d\d\.\d\d\d\.\d\d\d
(6) Re.search (R ' [Aeiou]], ' I loce fishc.com! ')
(7) small horizontal bar indicating range [A-z] [0-9]
(8) limit number of matches:Re.search (R ' ab{3}c ', ' ABBBC ')
(9) add range to match number : Re.search (R ' ab{3,10}c ', ' ABBBBBBBC ')
(10) Regular expressions do not have the concept of Chichong, only the concept of numbers
Re.search (R ' [01]\d\d|2[0-4]\d|25[0-5] ', ' 188 ') 0-255
{0,1} repeats 0 or 1 times,0 times means no repetition, no dispensable, can not exist.
- Scrapy
(1) scrapy is the art of art of the reptile.
(2)
"Reptile" python crawler