Who hasn't read a few novels yet! I use Python to crawl the entire station of the novel!

Source: Internet
Author: User
Tags xpath

Then send the request out, define the variable response, with the Read () method to observe, note the symbol decoded into the form of Utf-8, the province of garbled:

Print a look at the results:

See such a big bar on it, compare the source of the Web page, found to be consistent.

Failure status

Success Status

A successful connection after adding a path, waiting for connections on port 27017, means the database connection is successful, and then do not shut down the terminal, only to keep the database is connected, To run the MongoDB database (or you don't know how you Died)

OK, after connecting the database, we link the database and the editor interactively, the location is very secret, add the component Mongo Plugin under File>>settings>>plugins, do not download one:

If you decide to use XPath, we need to introduce the Etree module from lxml, and then we can parse the page with the HTML () method in etree, copy the path of the data we need from the page > prosecutorial Element (F12), and I choose the title and content of each chapter of the novel. ,:

Path//div[@class = "Readareabox content"]/h1/text ()

Path/html/body/div[4]/div[2]/div[2]/div[1]/div[2]/text ()

Note that another pit, when you copy XPath, gets this stuff:

div[@class = "Readareabox content"]/h1

and this stuff;

/HTML/BODY/DIV[4]/DIV[2]/DIV[2]/DIV[1]/DIV[2]

But what you need is the text in this path, so we need to add a specific text:/text (), and then just like the above. On the code, check the data:

Full code see Baidu Network disk: Private messages Small 02 can get cloud disk address

The novel is a little big, altogether is 3,500 chapters, waits for about 4-7 minutes, opens the folder "the repair Rovu novel", can see us to download the complete novel which does not need to turn over, the database page backs up each chapter the link, it automatically starts from the zero, is said you want to see 30th chapter to open the serial number to 29 the link, This is the order of the download, the author is very lazy, want to try the reader can change their own.

Novel texts

Database connection

Look, it feels good, good small examples are finished, and then we are ready to get to the point.

Scrapy Plugin installed successfully

And then the usual, do not want each terminal to run 1.1 points to find the path, the root directory is added to the environment variable, and then open the terminal, we test whether the installation is successful:

Scrapy Installation Success

OK, after the installation, open the terminal, create a new Scrapy project, where you can select the use of Scrapy according to the index, the various functions, here do not explain, the D-Disk has appeared in our set up a good Scrapy project folder:

Open the folder and we'll see that the Scrapy framework has automatically placed all the raw materials we need in the Reading folder:

Open the internal reading folder, you can add the crawler py code files in the Spiders folder:

We're here to crawl the novel leaderboards, in addition to the spider files we wrote, Also in the items.py to define what we want to crawl the content set, a bit like a dictionary, the name can be arbitrarily taken, but the inherited class Scrapy.item can not be changed, this is scrapy internal custom class, changed it can not find, spider with our above crawl single copy and add a For loop OK, very simple , a statement of disagreement on:

Crawler files

Climbing the list of novels

About 20 novels on each leaderboard

The crawl of each novel (in. json format)

Enter the group: 125240963 to get the source code

Who hasn't read a few novels yet! I use Python to crawl the entire station of the novel!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.