[Code] Python crawler practice: crawling the whole site novel ranking,
All those who like to read novels know that there are always some novels that are refreshing. no matter whether they are Xianxia or xuanhuan, after dozens of chapters, they have successfully circled a large number of fans and successfully climbed the list, the following are some examples of hot lists:
The new biquange is the favorite online novel reading network among many readers. The website includes the current information ...... I will not post any advertisement (other websites that meet the following requirements can also do this. I have done a simple chapter crawling precedent before, but the effect is not ideal, there are a lot of unnecessary components left behind, to link: http://python.jobbole.com/88560 /. This article crawls thousands of novels on this website. I would like to share with you some crawler ideas and some common pitfalls.
Context of this article:
1. Construct a single-book crawler trainer first;
2. I would like to briefly discuss several error-prone issues when installing the MongoBD database;
3. Use the Scrapy framework to crawl the ranking list of the entire site of the new pen quange.
I. Crawling a ticket
It is relatively easy to crawl the website. Open the Editor (PyCharm is recommended, with powerful functions). First, introduce the urllib module. request (introduce urllib and urllib2 to Python2.x. x is also written for you to see), provide the website URL, write down the request, and then add the Request Header (although this website is not blocked, however, the author suggests that you develop the habit of writing request headers every time. If you encounter a website like Douban that day, you will be banned if you don't mind it:
Then, send the request and define the variable response. Use the read () method for observation. Note that the symbols are decoded into UTF-8 format, saving garbled characters:
Print the result:
When we see such a large one, we can compare the source code of the web page and find that it is consistent.
This step is very important, because it indicates that the website is not asynchronously loaded using AJAX, or it will start to capture packets. We will talk about this when analyzing dynamic websites. We recommend that you use it when there is no better method. I remember there were some direct judgment methods, but I forgot it accidentally. Please send it to me if you know it.
Now we get the website's response. Next, we will parse and extract the data we want to obtain. However, considering that we need to crawl a large number of novels, it is really a failure to store a database without it. The author recommends the MongoDB database, which belongs to a NOSQL database and focuses on document storage. Here, it is too suitable to crawl novels. However, some programs are required for installation. For more information about how to install mongo, see the download and installation tutorials. For more information, see the link: mongodb (mongo should not be started as soon as it is installed ), then, you must add the dbpath accurately. Otherwise, opening the path will easily fail ,:
Failure status
Successful status
After the path is added, the connection is established successfully. If waiting for connections on port 27017 is displayed, the database connection is successful. Then, do not close the terminal. Only when the database is connected, to run the MongoDB database (otherwise, you do not know how you died when an error is reported)
Okay. After connecting to the database, we will link the database to the editor. The location is very confidential. Add the Mongo Plugin component under File> Settings> Plugins. If no one is available, download one:
Stolen Images
We write code in the editor, introduce Python to the module pymongo used to interact with MongoDB, and then link the MongoDB database port at the top. The default value is 27017, first, create a database named reading, and then create a data table named sheet_words in reading. The Code is as follows:
First, let's look for a novel called "Shura Wushen" to practice hands. Personally, I hate turning pages when reading novels, sometimes jumping out of advertisements, and at this time I have to go back and flip pages again, as a lazy person, I thought it would be nice to put the whole novel into a document, but if I copy and paste one chapter, I 'd like to forget it, at this time, you will know how convenient the crawler is. Well, what we need to do now is to crawl the complete novel "Shura Wu Shen" and back up it in the database. Let's go back to the location we just stayed at. After we get the response, we should use a method to parse the webpage. The general methods include re, xpath, selector (css ), we recommend that you use xpath instead of re. First, errors are easily caused by poor re usage. "When you decide to use a regular expression to solve the problem, you have two problems. ", Compared with xpath, the steps are clear and secure. Second, you can directly copy the xpath path in Firefox, Chrome, and other browsers, greatly reducing our workload ,:
If you decide to use xpath, we need to introduce the etree module from lxml, and then we can use the HTML () method in etree to parse the webpage. From the webpage> the element (F12) copy the path of the data we need. I chose the title and content of each chapter of the novel ,,:
Path // div [@ class = "readAreaBox content"]/h1/text ()
Path/html/body/div [4]/div [2]/div [2]/div [1]/div [2]/text ()
Note that there is another pitfall. What you get when copying xpath is this stuff:
// Div [@ class = "readAreaBox content"]/h1
And this stuff;
/Html/body/div [4]/div [2]/div [2]/div [1]/div [2]
But what you need is the text in this path, so we need to add another specific text:/text (), and then just like above. On the code, check the data:
For the complete code, see Baidu online storage:
Link: https://pan.baidu.com/s/1jhynf86password: ho9d
The novel is a little big. There are a total of three thousand five hundred chapters. Wait for about 4-7 minutes. Open the folder "Shura Wushen novels" and you will see a whole novel we downloaded without turning pages, the link of each chapter is backed up on the pages of the database, which is automatically arranged from the beginning. That is to say, you have to go to the link with the serial number 29 to check the download order, the author is very lazy. Readers who want to try it can change it on their own.
Novel text
Database Connection
Let's take a look. Let's take a look at it. Let's get started with a good example.
We need to crawl the entire website as in the above example. Of course, we do not recommend using a normal editor to execute this task. Smart readers have discovered that a novel has been crawled for four minutes, if you don't talk about thousands of books, it's enough to climb 100 in a group of rankings for a while. This shows the role of the Scripy framework, using a dedicated Scripy framework to write engineering crawlers is absolutely fast and labor-saving. It is an essential medicine for writing insects at home.
Ii. Crawl all novels in the novel list
First, install all Scrapy components. We recommend that you use pip except pywin32 to install Scrapy. If not, install it easily. pywin32 needs to download the same installation file as your Python version.
To connect: https://sourceforge.net/projects/pywin32/
Scrapy plug-in installed successfully
Then, we still use the old rule: if you don't want to find the path at every time the terminal runs, add the root directory to the environment variable and open the terminal. Let's test whether the installation is successful:
Scrapy installed successfully
After installation, open the terminal and create a Scrapy project. Here you can choose to use various Scrapy functions based on the index, the created Scrapy project folder has already appeared in disk D:
Open the folder and we will see that the Scrapy framework has automatically placed all the required raw materials in the reading Folder:
Open the internal reading folder and you can add the crawler py code file in the spiders Folder:
In addition to the spider file we wrote, we also need to go to the items. py defines the content set to be crawled. It is a bit like a dictionary. The name can be obtained at will, but the existing Inheritance class scrapy. item cannot be changed. This is a Scrapy internal custom class. If it is changed, it cannot be found. The spider will use the previous one to capture the single book and add a for loop. It is very simple. If it is not a word, It will be:
Crawler files
Ranklist
About 20 Novels in each ranking
Crawling of each novel (in. json format)
Novel display content
If you want to complete the code, you can reply in the public account dialog box. Keywords: novels are visible Baidu online storage links:
So far, all the data we need has been crawled, and they are all placed in the corresponding location according to the corresponding folder Directory, which is suitable for viewing in a rational manner.
Sharing a circle of friends is another kind of appreciation
The more we share, The more we have
Welcome to the efficient data analysis community
Add me to the big data dry goods group: tongyuannow
More than 100000 people are interested in joining us