Python crawler Tutorial -32-scrapy crawler Framework Project settings.py Introduction

Source: Internet
Author: User

This article introduces the project development process, the configuration and use of the Setting file

Python crawler Tutorial -32-scrapy crawler Framework Project settings.py Introduction
    • Use of settings.py files
    • To view more details of the settings.py file, see the Chinese Documentation:
      • Https://scrapy-chs.readthedocs.io/zh_CN/latest/topics/settings.html
Configuring User_agents in Settings
  • A lot of things in the settings.py file are commented out by default, and when we need to use them, we write our own content, based on the hints of the annotations.
  • For example:
  • We want to set up a user_agent list
      • Find User_agent in the settings.py file, copy the common USER _agent value below it
      • But settings only one line, is no specific content, we want to use, we need to fill in
      • This requires ourselves to find the common browser user-agent value on the Internet, I found some, want to use a direct copy can be
    User_agents = ["mozilla/5.0 (compatible; MISE 9.0; Windows NT 6.1; Win64; x64; trident/5.0;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 2.0.5.727; Media Center PC 6.0) "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.71 safari/537.36 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/537.11 "," mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/534.16 (khtml, like Gecko) chrome/10.0.648.133 safari/534.16 "," mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) gecko/20100101 firefox/34.0 "," mozilla/5.0 (X11; U Linux x86_64; ZH-CN; rv:1.9.2.10) gecko/20100922 ubuntu/10.10 (Maverick) firefox/3.6.10 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.95 safari/537.36 opr/26.0.1656.60 "," opera/8.0 (Windows N T 5.1; U EN) "," mozilla/5.0 (Windows NT 5.1; U En rv:1.8.1) gecko/20061208 firefox/2.0.0 Opera 9.50 "," mozilla/4.0 (compatible; MSIE 6.0; WindowS NT 5.1; EN) Opera 9.50 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/534.57.2 (khtml, like Gecko) version/5.1.7 safari/534.57.2 "," mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. net4.0e; qqbrowser/7.0.3698.400) "," mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Qqdownload 732;. net4.0c;. NET4.0E) ",]
  • Copy this code directly into the Settings file to
Configuring PROXIES in Settings
    • For more information about proxy IP, see: Python crawler tutorial -11-proxy proxy IP, hidden address (cat's eye movie)
    • Get the proxy IP Web site:
      • Www.goubanjia.com
      • Www.xicidaili.com
    • From the Web site to find the available IP, directly copy the line, and then copy the following code in the settings:
# IP 有效期一般20天,请自行到上述网站获取最新 IPPROXIES = [    {‘ip_port‘: ‘177.136.120.174:80‘, ‘user_passwd‘: ‘user1:pass1‘},    {‘ip_port‘: ‘218.60.8.99:3129‘, ‘user_passwd‘: ‘user2:pass2‘},    {‘ip_port‘: ‘206.189.204.62:8080‘, ‘user_passwd‘: ‘user3:pass3‘},    {‘ip_port‘: ‘125.62.26.197:3128‘, ‘user_passwd‘: ‘user4:pass4‘}]
    • These similar settings are set once and can be reused.
About de-weight
    • Many sites are the same content, such as the introduction of Python crawler, a lot of, assuming that when crawling to these times, we need a value, using the Scrapy function to prevent it to repeat the site unlimited climb down
    • In order to prevent the crawler from getting into a dead loop, you need to
      • That is, in the spider, in the parse function, when you return to Request, add the Dont_filter = False parameter
      myspider(scrapy.Spider): def parse (...):... yield scrapy.Request(url = url, callback = self.parse, dont_filter = False)
    • How to use selenium in Scrapy
      • Can be placed in the process _request function in the middleware
      • Call selenium in the function to return Response after the crawl is complete
      class MyMiddleWare(object): def process_request(...):     driver = webdriver.Chrome()     html = driver.page_source     driver.quit()     return HtmlResponse(url = request.url, encoding = ‘utf-8‘, body = html ,request = requeset
    • Next Link: Python crawler tutorial -33-scrapy shell usage and scrapy crawler framework Examples
    • Bye
    • This note does not allow any person or organization to reprint

Python crawler Tutorial -32-scrapy crawler Framework Project settings.py Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.