Python: About crawlers (1)

Last Update:2017-11-26 Source: Internet

Author: User

Tags file transfer protocol server hosting

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

To write crawler code using Python, we need to solve the first problem:

How does python access the Internet?

The answer to this question has to be mentioned is Urllib, which is actually made up of two parts: Url+lib.

URL: Is the Web address that we usually call
The meaning of Lib:library

The general format of the URL is (optional with square brackets []):
protocol://hostname[:p ort]/path/[;p Arameters][?query] #fragment

The URL consists of three parts:

The first part is the agreement: http,https,ftp,file,ed2k ...

Http:hypertext Transfer Protocol, which is the Hypertext Transfer Protocol, the protocol used by the World Wide Web browsing service program.

Https:secure Hypertext Transfer Protocol, the Secure Hypertext Transfer Protocol, enhances data security on an HTTP basis.

Ftp:file Transfer Protocol, the file transfer [output] protocol.

The File:file protocol is primarily used to access files on the local computer, just as you would open a file in Windows Explorer. The basic format is as follows: file:///file path, such as to open the F-Disk Flash folder 1.swf files, then you can type in the Explorer or [IE] Address bar: file:///f:/flash/1.swf and enter.

ed2k:edonkey2000 Network, a file-sharing web, was originally used to share music, movies, and software. As with most file-sharing networks, it is distributed, and files are stored on the user's computer rather than on a central server based on peer-to-peer principles. Said simple point is the beginning of emule, that is to download the file must use emule, but now Thunderbolt also support [ED2K protocol.

The second part is the domain Name System or IP address of the server hosting the resource (sometimes including the port number, each transport protocol has a default port number, such as HTTP default port number is 80).
The third part is the specific address of the resource, such as directory or file name. "The third part is usually negligible."

In python2.x, which contains the Urllib and URLLIB2 two modules (module), python3.x made a big change to it, merging it into a package, called Urllib, which contains four modules:

Urllib.request-for Opening and Reading URLs
Urllib.error-containing the exceptions raised by Urllib.request
Urllib.parse-for parsing URLs
Urllib.robotparser-for parsing robots.txt files

For example: We read the contents of a Web page

import urllib.requestresponse = urllib.request.urlopen("http://www.fishc.com")html=response.read()print(html)

The resulting results are:

To decode: Html=html.decode (' Utf-8 ')
The resulting results are:

Actual combat

Recommend a website: placekitten.com

A Quick and simple service for getting pictures of kittens for use as placeholders in your designs or code. Just Put your image size (width & height) After we URL and you ' ll get a placeholder.

You can get a picture of the cat you want in the following way:

Like this:http://placekitten.com/200/300
or:http://placekitten.com/g/200/300

First example: grabbing a picture

We create a new downloadcat.py file and execute the following code:

import urllib.requestresponse = urllib.request.urlopen("http://placekitten.com/g/320/240")cat_img = response.read()with open("cat_320_240.jpg", "wb") as f:    f.write(cat_img)

In downloadcat.py's sibling directory, you can see a picture of size 320*240:

Decoding program: The Urlopen () parameter can be either a string or a Request object. In fact, response can be decomposed into:

Req=urllib.request.request ("http://placekitten.com/g/320/240")
Response=urllib.request.urlopen (req)

Urlopen actually returns an object, and we can read its contents through read (). We can also:
Get the URL of the image via Response.geturl ();
Get the details of the picture by Response.info ();
The status code is obtained by Response.getcode ().

print(response.geturl())print(response.info())print(response.getcode())

A second example: translating text using Youdao dictionaries

We open Youdao dictionary, input English, will be automatically translated into Chinese

Open the browser's review element, switch to network, we find the name directory under the translate_o beginning of the file, click Open after you can see its PREVEIWS information:

It is the sentence that we translate, then switch to headers, you can see the basic information of headers:

which

Remote addresss is the IP address of the server and the port number it opens
The Request URL is the address that really implements the translation
Request method is the way
Status code is the state code, 200 indicates success

Request headers is a client-sent headers, which is typically used to determine whether a server-side is not human-access, primarily through the user-agent structure to identify browser access or code access. It is simple to customize.

Form data is the main content of the post submission. I followed by the text content to be translated

You can see the common status codes in the Notes list of this URL:
Http://www.runoob.com/ajax/ajax-xmlhttprequest-onreadystatechange.html
We create a new py file and execute the following code:

import urllib.request as urimport urllib.parse# 获取Request URlrequest_url = ‘http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&sessionFrom=‘# 获取到Form Data中的信息data = {        ‘i‘: ‘I love you‘,        ‘from‘: ‘AUTO‘,        ‘to‘: ‘AUTO‘,        ‘smartresult‘: ‘dict‘,        ‘client‘: ‘fanyideskweb‘,        ‘salt‘: ‘1505653077725‘,        ‘sign‘: ‘467d88b4cdc9c6adca72855020b6a1e8‘,        ‘doctype‘: ‘json‘,        ‘version‘: ‘2.1‘,        ‘keyfrom‘: ‘fanyi.web‘,        ‘action‘: ‘FY_BY_CLICKBUTTION‘,        ‘typoResult‘: ‘true‘        }data=urllib.parse.urlencode(data).encode(‘utf-8‘)response = ur.urlopen(request_url, data)html=response.read().decode(‘utf-8‘)print(html)

Take note:
Remember to remove the "_0" in the Request_url, otherwise error {"ErrorCode": 50}

The results obtained:

From the running results, we know that the data is actually JSON format, it is a kind of lightweight data interchange format. For JSON usage, refer to the following URL:
Http://www.runoob.com/json/json-tutorial.html

Add the following code:

import jsontarget = json.loads(html)print(target)

The resulting results are:

Target's data type is dict, and we access it one layer at a level:

print(target[‘translateResult‘])print(target[‘translateResult‘][0][0])print(target[‘translateResult‘][0][0][‘tgt‘])

Can get the target information we want:

We can make simple changes to the code to get a better user experience: (the ellipsis is the same as before)

content = input("请输入需要翻译的内容：")......‘i‘:content......print("翻译结果：%s" % target[‘translateResult‘][0][0][‘tgt‘])

Test results:

***

Hide

Translation in the form above will be a huge burden on the server, and frequent visits to Ken will be blocked. In order for the code to work properly, we need to hide the code so that it can be accessed more closely to the browser.

Based on some of the features of the review element described above, we mentioned the role of User-agent, and now we need to manipulate it to simulate browser access.
We found this information in the review element:

The first modification method: Through the request of the headers parameter modification, the response is modified to two sentences, the head is added in front of the req.

head = {}head[‘User-Agent‘] = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36‘......req = ur.Request(request_url, data,head)response = ur.urlopen(req)......

The second method of modification: modified by the Request.add_header () method, the response is modified to two sentences, the head is added behind the req

......req = ur.Request(request_url, data)req.add_header(‘UserAgent‘,‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36‘)response = ur.urlopen(req)......

Test method: Print (Req.headers)

Take note:
ur is shorthand, the introduction of the package has been declared
Import urllib.request as Ur

Modify User-agent is the simplest method of hiding, but if this is used to crawl the web crawler, then an IP address for a short period of time will be continuously accessed, which is not normal, will be a large pressure on the server, return to the User verification Code fill page, crawler is unable to identify the verification code, At this time user-agent will not work. There are two ways we can deal with this problem.

First: Delay the commit time, make the crawler look more normal

We can do the following with the above code:

......import time......# 将主体代码进行缩进，放在一个while循环里面while True:      content = input("请输入需要翻译的内容(输入‘q!‘退出程序)：")      if content == ‘q!‘:           break      ......      time.sleep(5000)

The first translation is done five seconds after the completion of the second translation

The first way of malpractice: inefficiency.

The second type: using proxies

Steps:
1. The parameter is a dictionary {' type ': ' Proxy IP: port number '}
Proxy_support=urllib.request.proxyhandler ({})
2. Customizing and creating a opener
Opener=urllib.request.build_opener (Proxy_support)
3.1 Installing opener
Urllib.request.install_opener (opener)
3.2 Call Opener[When special needs are used can not install direct call]
Opener.open (URL)

import urllib.request# 访问时会显示来源IPurl = ‘http://www.whatismyip.com.tw‘# 按步骤，代理ip地址可以去网上找proxy_support = urllib.request.ProxyHandler({‘http‘: ‘118.193.107.131:80‘})opener = urllib.request.build_opener(proxy_support)urllib.request.install_opener(opener)response = urllib.request.urlopen(url)html = response.read().decode(‘utf-8‘)print(html)

Proxy IP address can be accessed when the 404 error, may be filtered out, we can make simple changes to the code, to opener plus headers information

opener.addheaders = [(‘UserAgent‘,‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36‘)]

In order to avoid the problem of access error caused by IP instability, we can establish a ip_list, randomly using one of the IP to access

ip_list = [ ]proxy_support = urllib.request.ProxyHandler({‘http‘: random.choice(ip_list)})# 其余不变

Python: About crawlers (1)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More