This is to capture the top250 information of a Douban.
First open pycharm
Enter scrapy startproject Douban in terminal at the lower end of pycharm
In this case, the system generates the following file (the spiders file contains a _ init _. py) and a _ init _. py items. py middlewares. py piplines. py settings. py
From the first article, we know that the scrapy framework has only three things. We need to operate on one of them: items, settings, and another spider file created under the same conditions.
First open items. py
Items. py is the place where we define the data structure. Now we define what will be stored in the future.
What we need is the serial number, movie name, Movie Introduction, star rating, movie comment, movie description
You can create your desired content in the format of # name = scrapy. Field () by default.
Then we change settings. py.
More settings. py content first find robotstxt_obey = true
Because what we want to do is violate this rule, so the first thing is to change true to false.
The second thing is to change download_delay = 3 to download_delay = 0.5.
In this way, we can achieve faster speed.
The most important thing is user_agent.
We go to our target Website: https://movie.douban.com/top250
Press F12 to open the debugging tool and press F5 to refresh the page. Find the required top250 text to view the html
Click top250 and pull down to find user_agent.
Copy the content to our settings. py so that the current setting. py is complete.
You can create a crawler file as follows:
Scrapy genspider crawler Name Domain Name
A crawler file is generated.
Create a scrapy Project