The first thing to understand is that crawling pages are actually:
Find a list of URLs (URLs) that contain the information we need
Download the page back via the HTTP protocol
Parse the required information from the HTML of the page
Find out more about this URL and go back to 2 to continue
Second, we must understand:
A good list should:
URLs that contain enough movies
Through the page, you can traverse all the movies
A list sorted by update time to catch the latest updated movies faster
The final simulation process knows that the Douban station cannot crawl all the information at once, only the category crawl
Using the tool Pyspider
Analyze the implementation code, test the simulation run, crawl each type of latest movie information by Time list
Code decomposition, easy to join team members understand the code
Team member: Zhang Wenjan, Repeti
Team-Zhang Wenjan-demand Analysis-python crawler classification crawl Watercress movie Information