"Reptile" python crawler

Source: Internet
Author: User
Tags urlencode

Reptile Chapters

1.python How to access the Internet

URL(web address)+lib="urllib

2. Check Documentation: Python document.

3.response = Urllib.request.urlopen ("" www.baidu.com)

HTML =html.decode ("Utf-8") to break binary decoding operations

4. read Web images WB: Binary

Urlopen=request + Urlopen

    1. Browser -- Review element -- view Network(python submits POST form )

(Browser and client communication content)

Get: Request data from server

POST: Submit processed data to the specified server

8 min to click POST(translate?) smartresult)

Click to open it. Preview will be able to see the translated content.

(1) then analyze the contents of headers

①Status Code: The meaning of normal response 404 Indicates an abnormal response

②Request Headers server. Typically, the following user-agent Identify whether the browser accesses or code access

③Form Data: main content of POST submission

(2) POST requires the specified data format, which can be converted via the parse format

(3) code for translating POST function

Import Urllib.request

Import Urllib,parse

Url= '//headers the Request URL below

data={}// is Headers below the from Data

Type:auto

I:l Love fishc.com!

Doctype:json

xmlversion:1.6

Keyfrom:fanyi.web

Ue:utf-8

Typorsult:true

Response = Urllib.request.urlopen (url,data)

Modify:

data[' type '] = ' AUTO '

data[' i ']= ' I love fishc.com! '

data[' doctype ' = ' json '

data[' xmlversion ' = ' 1.6 '

data[' keyfrom '] = ' fanyi.web '

data[' UE '] = ' UTF-8 ' UTF-8 is a flexible form of encoding for Unicode

data[' typorsult ' = ' true '

Urllib.parse.urlencode (data)// encoded form

Response = Urllib.request.urlopen (url,data). Encode (' Utf-8 ')//python default is Unicode form

Html = Response.read (). Decode (' utf-8 ')// convert other encoded forms into Unicode form

Print (HTML)

the output is a JSON structure

Import JSON

Json.loads (HTML)// loaded after discovery is a dictionary

Target = json.loads (HTML)

Type (target)

output:<class ' dict ' > dictionary form

target[' Translateresult ']

Output: [[{' TGT ': ' i love [email protected] ', ' src ': ' I loved fishc.com! '}]

>>>target[' Translateresult '][0][0][' TGT ']

Output: ' I love fishc.com! '

After the above analysis, the code is as follows:

Import Urllib.request

Import Urllib,parse

Import JSON

Content = input (" Please enter what you want to translate :")

URL = ' http:..fanyi.youdao.com/translagte?smartresult=dict&smartresult=;xxxxx '

Data = {}

data[' type '] = ' AUTO '

data[' i ']= ' I love fishc.com! '

data[' doctype ' = ' json '

data[' xmlversion ' = ' 1.6 '

data[' keyfrom '] = ' fanyi.web '

data[' UE '] = ' UTF-8 ' UTF-8 is a flexible form of encoding for Unicode

data[' typorsult ' = ' true '

Urllib.parse.urlencode (data). Encode (' utf-8 ')// encoded form

Response = Urllib.request.urlopen (url,data)

Html = Response.read (). Decode (' Utf-8 ')

Target = json.loads (HTML)

Print (" translation result :%s"% (target[' translateresult '][0][0][' TGT '))

    1. Modify Headers emulation Browser Login

(1) method one: Modified by the headers parameter of Request

Add code

Head = {}

head[' user-agent ' = ' copy the user-agent in the head of the code . '

(2) method Two: Modify by Request.add_header() method

in req = Urllib.request.Request (url,data,head)

Req.add_header (' user-agent ', ' that site ')

    1. Methods to prevent Web sites from setting permissions for multiple accesses to the same IP over a short period of time

(1) Delayed access time ( not recommended )

(2) Agent

The ① parameter is a dictionary {' type ': ' proxy IP: port number '}

Proxy_support = Urllib.request.ProxyHandler ({})

② customizing, creating a opener

Opener = Urllib.request.build_opener (proxy_support)

③A. installation opener

Urllib.request.install_opener (opener)

B. call opener

Opener.open (URL)

(direct search for proxy IP is ok)

Import Urllib.request

URL = ' http://www.whatismyip.com.tw '

Proxy_support = Urllib.request.ProxyHandler ({' http ': ' 119.6.144.73:81 '})

Opener = Urllib.request.build_opener (proxy_support)

Urllib.request.install_opener (opener)

Urllib.request.urlopen (URL)

Html = Response.read (). Decode (' Utf-8 ')

Print (HTML)

④ Modify the opener headers as follows :

Import Urllib.request

URL = ' http://www.whatismyip.com.tw '

Proxy_support = Urllib.request.ProxyHandler ({' http ': ' 119.6.144.73:81 '})

Opener = Urllib.request.build_opener (proxy_support)

Opener.addheaders = [(' User-agent ', ' is the content of that webpage ')]

Urllib.request.install_opener (opener)

Urllib.request.urlopen (URL)

Html = Response.read (). Decode (' Utf-8 ')

Print (HTML)

⑤ Building multiple IPs

Import Urllib.request

Import Random// use arbitrary, use IP randomly

URL = ' http://www.whatismyip.com.tw '

IPList = [' 119.6.144.73l81 ', ' 183.203.208.166:8118 ', ']

Proxy_support = Urllib.request.ProxyHandler ({' http ': ' Random.choice (IPList '})

Opener = Urllib.request.build_opener (proxy_support)

Opener.addheaders = [(' User-agent ', ' is the content of that webpage ')]

Urllib.request.install_opener (opener)

Urllib.request.urlopen (URL)

Html = Response.read (). Decode (' Utf-8 ')

Print (HTML)

    1. Crawl the first 10 pages of the site and save the image (modular)

Import Urllib.request

Import OS// Create folder in current directory

Def get_page (URL);

Req = urllib.request.Request (URL)

Req.add_header (' user-agent ', ' head URL ')

Response

Def find_imgs (URL):

Pass

Def Save_imgs (Folder,img_addrs):

Pass

Def download_mm (folder= ' Ooxx ', pages=10)://pages means download page

Os.mkdir (folder)// name of folders created

Os.chdir (folder)

Url= "Http://jandan.net/ooxx"
Page_num =int (get_page (URL))

For I in range (pages):

Page_num-=i

Page_url = url+ ' page-' +str (page_num) + ' #comments '// the address has been acquired at this time

Next, open the website to get the address of the picture

Img_addrs = Find_imgs (Page_url)

Save_imgs (Img_addrs)// be familiar with modular thinking

If __name__ = = ' __main__ ':

DOWNLOAD_MM ()

Additional: If __name__ = = ' __main__ ' effect: Test module availability.


9. If you want to invoke a function, simply type it.

For example, I define

Def url_open (URL):

Req = urllib.request.Request (URL)

Req.add_header (' user-agent ', ' address ')

Response = Urllib.request.urlopen (URL)

Html = Response.read ()

Return html

if you want to use a URL in def get_page (URL)

just the direct html = url_open (URL). Decode (' Utf-8 ')

In other words, this is the same as the mathematical function reference.

    1. Regular expressions

(1) Search method is found in the content of the first successful match location;

(2). : Represents any file name (excluding line breaks), wildcard characters

(3) if you want to find . . Just use \.

(4) \d: Match all numbers

(5) matching IP address:\d\d\d\.\d\d\d\.\d\d\d

(6) Re.search (R ' [Aeiou]], ' I loce fishc.com! ')

(7) small horizontal bar indicating range [A-z] [0-9]

(8) limit number of matches:Re.search (R ' ab{3}c ', ' ABBBC ')

(9) add range to match number : Re.search (R ' ab{3,10}c ', ' ABBBBBBBC ')

(10) Regular expressions do not have the concept of Chichong, only the concept of numbers

Re.search (R ' [01]\d\d|2[0-4]\d|25[0-5] ', ' 188 ') 0-255

{0,1} repeats 0 or 1 times,0 times means no repetition, no dispensable, can not exist.

    1. Scrapy

(1) scrapy is the art of art of the reptile.

(2)

"Reptile" python crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.