First, the main ideas
- Scrapy Crawl is a course address and name
- Download using the multiprocessing
- Just to crawl a bit of video, so it's a simple code stack.
- The way you want to share it without practicing it.
Ii. Description of the document
- itemsscray field
- piplines.py Storage database
- setting.py scrapy configuration You need to be aware of default_request_headers settings, need to impersonate the login
- mz.py is the main crawler is the basic crawler functions, css+xpath+
- start_urls = [ "http://www.maiziedu.com/course/web/" ,] Only crawled web, can be done as needed, or all,
- I wanted to not store it in the database, Download directly in mz.py, but considering that it will affect Scrapy's original performance, download it separately
- down.py using multiprocessing to download the original thought of dynamic monitoring scrapy in the database results, want to realize the sharing of the process, debugging several times also have problems so directly with the Pool.map () This more rough way,
- Mz.json existing JSON, but considering that the JSON file back and forth, affecting efficiency, so instead of the database
iii. Results
- Source: Https://yunpan.cn/crjn7J97xUD8F Access Password 6219
- Video address: https://yunpan.cn/crjXKLGnkpzPk access password 6C15
From for notes (Wiz)
Python crawls and downloads all the Wheat Academy Video tutorials