Since it is a Python programmer xxx, it is necessary to use Python Programmer's method.
Today our goal is to crawl the community of beauty ~ and, we have to use a new posture (fog) ~scrapy crawler Frame ~
1scrapy principle
After writing a few bots, we know that using crawlers to get data about the steps: Requesting Web pages, obtaining Web pages, matching information, downloading data, cleaning data, and storing it in a database.
Scrapy is a well-known reptile framework that makes it easy to crawl web information. So how does scrapy work? I've seen a lot of scrapy tutorials on the internet before, and most of the Getting started tutorials come with this picture.
_ (: Зゝ∠) _ also do not know is this picture is too classic, or programmers are lazy to draw, the first time to see this picture, rice sauce mood is such
After a deep understanding, probably know the meaning of this picture, let me raise a chestnut (yes, I want to raise strange chestnuts again):
Want to learn python? Dabigatran: 725479218, a lot of learning materials, and a variety of source code (limit the top 5)
When we want to eat, we will go out, walk down the street, look for a point to eat, then order a meal, the waiter informs the kitchen to do, the final dish to the table, or be packed away. This is what the bot is doing, and it's going to write all the operations that are needed to get the data.
And Scrapy is like a point-of-order app in general, in the order list (spiders) to choose their own target restaurant to eat (items), in the receiving (pipeline) write their own delivery address (storage), ordering system (Scrapy engine) The kitchen (download) of the shop (Internet) will be prepared according to the order, and the Delivery Order (request) will be distributed according to the Dispatch (schedule), and the takeout brother can pick up the goods from the Kitchen (request) and deliver ( Response). Saying that I am hungry ....
What do you mean? When using scrapy, we only need to set spiders (want to crawl content), pipeline (data cleaning, data storage), there is a middlewares, is the docking between the functions of some settings, you can not worry about other processes, Everything is handed to the Scrapy module to complete.
2 Creating a Scrapy Project
After installing Scrapy, create a new project
People with Pycharm should know how to operate!!
I'm using the Pycharm compiler, which is created under the Spiders file zhihuxjj.py
3 Crawl rule making (spider)
Create the project, let's have a look at the shops and dishes we want to eat ... Oh no, the site and data to crawl.
I chose the knowledge as a crawl platform, knowing that there is no user from 1 to n sequence ID, everyone can set their own profile ID, and is unique. So pick a seed user to pick, crawl his followers, but also followers and fans to climb, considering that there are some three fans, I only choose to crawl the list of followers, and then through the Follower homepage crawl followers of the follower, so recursive.
For the design of the program, this is the case.
The start URL is a symbolic value in Scrapy, which is used to set the start of the crawler, that is, where to start crawling, according to the settings, start crawling from the seed user's homepage is justice, but considering the personal page link will be reused, so here I set the starting URL to the home page.
After that is the seed user's personal homepage, know that many fans of the Big V a lot, but the attention of more people is more difficult to find, here I chose the Zhihu, co-founder, presumably concerned about a lot of high-quality users (???).
Analysis of the personal page, the personal page is composed of ' https://www.zhihu.com/people/' + user ID, we want to get information is used callback callback function (knock the blackboard!!) Draw the key!! ), there are two callback functions designed here: The user's watchlist and the followers ' personal information.
The page that you view in Chrome browser shows the URL to the watchlist and the user ID of the followers.
Place the mouse on the user name.
A URL that can get personal user information. The analysis URL indicates:
So, we write the following code in the zhihuxjj.py file that we created in the previous section.
Here we need to focus on the use of yield, as well as item[' name '], to assign the crawl result to item, that is, to tell the system, this is our choice of dishes ... Ah, yuck ... The target data to crawl.
4 Setting additional Information
In the items.py file, add the corresponding code according to the target data item set in the Spider.
Add the code to the database in the pipeline.py (the database is written in the previous article Oh ~).
Because of the use of pipeline.py, so we also need to setting.py the file, the item_pipeline annotation will be removed, here to connect the role of two files.
It seems like... Forgot something, right, forgot to set the headers. The common method of setting headers is also in the setting.py file, which cancels the Defaultrequestheaders code comment State and sets the mock browser header. Know is to be simulated login, if you log in using a visitor, you need to add authorization, as to how this authorization is obtained, I, on, no, Sue, Sue, you (fled
In order to reduce the server pressure & prevent being blocked, release the Downloaddelay comment status, this is set download delay, the download delay is set to 3 (robots rule is required to be 10, but 10 is too slow _ (: Зゝ∠) know the programmer little brother can't see this sentence.
Write here you will find that many of the operations we need to do, scrapy have been written, just need to remove the comments, and then slightly modified, you can implement the function. The Scrapy framework also has many features that can be read in the official documentation.
5 Running the Scrapy file
After writing the Scrapy program, we can enter it in the terminal.
Run the file.
However, you can also add main.py to the folder and add the following code.
Then directly with Pycharm run main.py file, and then we can happily crawl know user ~ (XXX sister I come ~
6 Check XXX
After x days of operation, _ (: Зゝ∠) _ climbed to 7w user data, crawl depth 5. (This crawl speed makes me feel the need to be on the distributed crawler ... This is another day to chatter.)
With the data we can choose, with the city's users to study ...
First, international practice analysis of the data.
In 7w users, obviously more than half of men, indicating that they are female users accounted for only about 30%, there is a part of the lack of gender, high-quality XXX sister or scarce resources AH ~
Then look at the XXX sisters are in which city. (Users who filter gender from 7w users and address information is not empty)
It seems that xxx sisters are still concentrated in the North Canton, so I want to find high-quality xxx elder boys paper or to the front line ah, of course, do not rule out in the two or three lines of XXX sisters no mark location of their own.
Emmmmm ... This analysis, so far, you can go to tease xxx sisters. (Escape
7 Study XXX Sister
Not surprisingly? Are you having fun? There is also a chapter here. As the so-called, the awarding of fish, rather than the grant of fishing, sprinkle the soul of chicken soup, but also to add a heart chicken leg; found xxx elder sister, we also want to know xxx elder sister ....
Let me raise a chestnut ~ to study a xxx elder sister. (Know First name: Dynamic times, has acquired XXX sister authorization as an example.) )
Let's crawl her dynamic, chrome right-click to check the network of these routines I will not say, directly to research objectives.
The code also does not post, will put in Gayhub, look at the output.
And also!! In the attention, approval and output, all the words (ω). (Is it possible to catch xxx sister by delicious ...)
Another Zhang Liu looked at the mountain background, answer the word cloud.
Using Python to find the fate of the other half, we all feel that is not credible, after all, have not read this divine Tutorial!