Python crawler Tutorial -32-scrapy crawler Framework Project settings.py Introduction

Last Update:2018-09-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article introduces the project development process, the configuration and use of the Setting file

Use of settings.py files
To view more details of the settings.py file, see the Chinese Documentation:
- Https://scrapy-chs.readthedocs.io/zh_CN/latest/topics/settings.html

Configuring User_agents in Settings

A lot of things in the settings.py file are commented out by default, and when we need to use them, we write our own content, based on the hints of the annotations.
For example:

We want to set up a user_agent list

Find User_agent in the settings.py file, copy the common USER _agent value below it
But settings only one line, is no specific content, we want to use, we need to fill in
This requires ourselves to find the common browser user-agent value on the Internet, I found some, want to use a direct copy can be

User_agents = ["mozilla/5.0 (compatible; MISE 9.0; Windows NT 6.1; Win64; x64; trident/5.0;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 2.0.5.727; Media Center PC 6.0) "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.71 safari/537.36 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/537.11 "," mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/534.16 (khtml, like Gecko) chrome/10.0.648.133 safari/534.16 "," mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) gecko/20100101 firefox/34.0 "," mozilla/5.0 (X11; U Linux x86_64; ZH-CN; rv:1.9.2.10) gecko/20100922 ubuntu/10.10 (Maverick) firefox/3.6.10 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.95 safari/537.36 opr/26.0.1656.60 "," opera/8.0 (Windows N T 5.1; U EN) "," mozilla/5.0 (Windows NT 5.1; U En rv:1.8.1) gecko/20061208 firefox/2.0.0 Opera 9.50 "," mozilla/4.0 (compatible; MSIE 6.0; WindowS NT 5.1; EN) Opera 9.50 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/534.57.2 (khtml, like Gecko) version/5.1.7 safari/534.57.2 "," mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. net4.0e; qqbrowser/7.0.3698.400) "," mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Qqdownload 732;. net4.0c;. NET4.0E) ",]

Copy this code directly into the Settings file to

Configuring PROXIES in Settings

For more information about proxy IP, see: Python crawler tutorial -11-proxy proxy IP, hidden address (cat's eye movie)
Get the proxy IP Web site:
- Www.goubanjia.com
- Www.xicidaili.com
From the Web site to find the available IP, directly copy the line, and then copy the following code in the settings:

# IP 有效期一般20天，请自行到上述网站获取最新 IPPROXIES = [    {‘ip_port‘: ‘177.136.120.174:80‘, ‘user_passwd‘: ‘user1:pass1‘},    {‘ip_port‘: ‘218.60.8.99:3129‘, ‘user_passwd‘: ‘user2:pass2‘},    {‘ip_port‘: ‘206.189.204.62:8080‘, ‘user_passwd‘: ‘user3:pass3‘},    {‘ip_port‘: ‘125.62.26.197:3128‘, ‘user_passwd‘: ‘user4:pass4‘}]

These similar settings are set once and can be reused.

About de-weight

Many sites are the same content, such as the introduction of Python crawler, a lot of, assuming that when crawling to these times, we need a value, using the Scrapy function to prevent it to repeat the site unlimited climb down
In order to prevent the crawler from getting into a dead loop, you need to
- That is, in the spider, in the parse function, when you return to Request, add the Dont_filter = False parameter
```
myspider(scrapy.Spider): def parse (...):... yield scrapy.Request(url = url, callback = self.parse, dont_filter = False)
```

How to use selenium in Scrapy

Can be placed in the process _request function in the middleware
Call selenium in the function to return Response after the crawl is complete

class MyMiddleWare(object): def process_request(...):     driver = webdriver.Chrome()     html = driver.page_source     driver.quit()     return HtmlResponse(url = request.url, encoding = ‘utf-8‘, body = html ,request = requeset

Next Link: Python crawler tutorial -33-scrapy shell usage and scrapy crawler framework Examples
Bye

This note does not allow any person or organization to reprint

Python crawler Tutorial -32-scrapy crawler Framework Project settings.py Introduction

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler Tutorial -32-scrapy crawler Framework Project settings.py Introduction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler Tutorial -32-scrapy crawler Framework Project settings.py Introduction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support