Many times, in the case of crawling without login, we can also access some of the pages or request some interfaces, because after all, the site itself needs to do SEO, will not set the login restrictions on all pages.
However, there are some drawbacks to not logging in to direct crawling, and the following two points are the main drawbacks.
The page that has the login limit set cannot be crawled. If a forum is set up to see Resources, a blog set up login to see the full text, and so on, these pages need to login to view and crawl.
Some pages and interfaces can be requested directly, but once the request is frequent, access is easily restricted or the IP is directly blocked, but there is no such problem after login, so it is less likely to be back crawled after logging in.
Let's do a simple experiment on the second case. For example, we first find an Ajax interface, such as the information interface of Sina Finance official Weibo https://m.weibo.cn/api/container/getIndex?uid=1638782947&luicode= 20000174&type=uid&value=1638782947&containerid=1005051638782947, if accessed directly by the browser, the returned data is in JSON format, as shown in It contains some information about Sina financial official Weibo, which can be extracted directly by parsing JSON.
However, this interface will have request frequency detection without logging in. If access is too frequent for a period of time, such as opening this link and constantly refreshing, you will see a prompt with too high a request, as shown in.
If you reopen a browser window, open https://passport.weibo.cn/signin/login?entry=mweibo&r=https://m.weibo.cn/, log in to your Weibo account and reopen the link, The page normally displays the result of the interface, and the page that is not logged on still shows the request too frequently, as shown in.
The left side of the figure is the result of the request interface after login, the right side is the result of the login request interface, the interface link is exactly the same. The non-logon status is not accessible, and the logon status is displayed correctly.
Therefore, the login account can reduce the probability of being banned.
We can try to log in and then crawl, the probability of being banned is much smaller, but also can not completely exclude the risk of being banned. If the same account has been frequently requested, it is possible to encounter the request is too frequent and the issue of the number.
If we need to do a large-scale crawl, we need to have a lot of accounts, each request to randomly select an account, which reduces the frequency of access to a single account, the probability of being blocked will be greatly reduced.
So how do you maintain login information for multiple accounts? Then you need to use the cookie pool. Next we look at how the cookie pool is constructed.
First, this program is marked
We use Sina Weibo as an example to implement a cookie pool construction process. There are many Sina Weibo accounts and cookies stored in the cookie pool, and the cookie pool also needs to periodically detect the validity of each cookie, and if a cookie is invalid, delete the cookie and impersonate the login to generate a new cookie. At the same time, the cookie pool also needs a very important interface, that is, access to random cookies interface, cookies run, we only need to request the interface, we can randomly obtain a cookie and crawl it.
This shows that the cookie pool needs to have several core functions, such as automatic generation of cookies, timed detection of cookies, and random cookies.
Ii. preparatory work
You'll need some Weibo accounts before you build. You need to install the Redis database and make it work properly. You need to install Python's redispy, requests, selelnium, flask libraries. In addition, you will need to install chrome and configure Chromedriver.
Third, the cookie pool structure
The structure of the cookie is similar to the proxy pool, which is also 4 core modules, as shown in.
The basic module of the cookie pool architecture is divided into 4 blocks: storage module, generating module, detection module, interface module. The functions of each module are as follows.
The storage module is responsible for storing the username and password of each account and the corresponding cookie information for each account, as well as providing some methods for convenient access operation.
The build module is responsible for generating new cookies. This module will pick up the username and password of the account from the storage module, and then impersonate the login target page to determine the success of the login, then return the cookies to the storage module storage.
The detection module needs to be timed to detect cookies in the database. Here we need to set up a detection link, different site detection links, the detection module will take the account corresponding to the cookie to request the link, if the return status is valid, then this cookie is not invalid, otherwise the cookie expires and removed. Next wait for the build module to regenerate.
Interface modules require APIs to provide an interface to external services. Since there may be multiple cookies available, we can randomly return the interface of the cookies so that every cookie is possible to be taken. The more cookies, the smaller the probability that each cookie will be taken, thereby reducing the risk of the number being blocked.
The basic idea behind the design of the cookie pool is similar to the one previously mentioned in the proxy pool. Next we design the overall architecture and then implement the cookie pool in code.
Iv. realization of the cookie pool
First, the realization of each module is understood separately.
1. Storage Modules
In fact, the content that needs to be stored is just account information and cookie information. The account name and password are two parts, we can be stored in the user name and password mapping. Cookies can be stored as JSON strings, but we will need to generate cookies from our account later. When generated we need to know which accounts have generated cookies, which are not generated, so we need to keep the cookie corresponding to the user name information, but also the user name and cookie mapping. Here is the two sets of mappings, we naturally think of the redis hash, so the establishment of two hash, the structure is as shown.
The hash key is the account number, which corresponds to the password or cookie. It is also important to note that because the cookie pool needs to be extensible, stored accounts and cookies are not necessarily only the Weibo in this example, other sites can also be connected to this cookie pool, so the name of the hash here can do two class classification, For example, the hash name of the deposit account can be accounts:weibo,cookies hash name can be Cookies:weibo. To expand the cookie pool, we can use Accounts:zhihu and Cookies:zhihu, which is more convenient.
Next we create a storage module class to provide some basic operations of hash, the code is as follows:
83182789
Here we create a new RedisClien
T class, the initialization __init__()
method has two key parameters type
and website
, respectively, represents the type and site name, they are used to stitch the hash name of the two fields. If this is a hash of the storage account, then here is, type
accounts
website
for weibo
, if the cookie is stored in the hash, then here is type
cookies
, website
for weibo
.
Next there are several fields that represent the Redis connection information, initialize the object when initialized StrictRedis
, and establish a redis connection.
name()
The method is stitched together type
and website
, the name of the hash is composed. set()
, get()
and delete()
methods, respectively, to set, get, delete a hash of a key value pair, count()
get the length of the hash.
The most important method is random()
that it is mainly used to randomly select a cookie from a hash and return it. Each time a method is called, random()
a random cookie is obtained, which is interfaced with the interface module to obtain random cookies from the request interface.
2. Building the module
The build module is responsible for obtaining individual account information and simulating logins, then generating cookies and saving them. We first get two hash information, see the account hash more than the hash of the cookie which has not generated a cookie account, and then the remaining account to traverse, and then to generate cookies.
The main logic here is to identify those accounts that do not have a corresponding cookie, and then retrieve the cookies one by one, with the following code:
83182789
Because we are docking Sina Weibo, in front we have cracked Sina Weibo's four Gongge verification code, here we directly to take, but now need to add a method to obtain cookies, and for different situations return different results, the logic is as follows:
83182789
The type of return result here is a dictionary, and with the status code status
, in the generation module we can do different processing according to different status codes. For example, if the status code is 1, the successful acquisition of cookies means that we only need to save cookies to the database. If the status code is 2, on behalf of the user name or password error, then we should be the current database stored in the account information deleted. If the status code is 3, it represents some errors of login failure, at this time can not determine whether the user name or password error, and can not successfully obtain cookies, then the simple prompt for the next processing, similar code implementation is as follows:
83182789
If you want to extend another site, just implement the new_cookies()
method, and then follow this processing rule to return the corresponding simulated login results, such as 1 for the success, 2 for the user name or password error.
After the code runs, it iterates through an account that has not yet generated the cookies, and simulates the login to generate new cookies.
3. Detection module
We can now use the Generate module to generate cookies, but we are still unable to avoid the problem of cookies, such as the time too long cause the cookie to lapse, or the use of cookies too frequently caused by the inability to request the Web page. If we encounter such cookies, we must not allow it to be kept in the database.
So we also need to add a timing detection module, which is responsible for traversing all the cookies in the pool, and setting up a corresponding detection link, we use a cookie to request this link. If the request succeeds, or if the status code is valid, the cookie is effective, and if the request fails, or if no normal data can be obtained, such as jumping directly to the login page or jumping to the verification page, then this cookie is invalid and we need to remove the cookie from the database.
After this cookie has been removed, the generated module will detect the cookie hash and account hash less than the cookie, the generation module will think that the account has not generated cookies, then will use this account to re-login, The cookies for this account have been re-updated.
What the detection module needs to do is detect the cookie invalidation and then remove it from the data.
To achieve universal extensibility, we first define a detector's parent class, declaring some common components, as follows:
83182789
Defined here is a parent class called ValidTester
, in the __init__()
method to specify the name of the site, in addition to website
establish two storage module connection objects cookies_db
and accounts_db
, respectively, responsible for the operation of cookies and account hash, the run()
method is the entrance, Here is the traversal of all the cookies, and then call test()
methods to test, where the test()
method is not implemented, that is, we need to write a subclass to override this test()
method, each subclass is responsible for the detection of different sites, such as the detection of micro-blog can be defined as WeiboValidTester
, to implement its unique test()
method to detect whether the cookie is legal, and then do the corresponding processing, so here we also need to add a subclass to inherit this ValidTester
, rewrite its test()
method, implemented as follows:
83182789
test()
The method first converts the cookies into a dictionary, detects the format of the cookies, if the format is incorrect, deletes them directly, and if the format is not a problem, take this cookie to request the URL to be detected. test()
method here to detect the microblog, the detected URL can be an Ajax interface, in order to achieve the configurable, we will also define the test URL as a dictionary, as follows:
TEST_URL_MAP = { ‘weibo‘: ‘https://m.weibo.cn/‘}复制代码
If you want to extend other sites, we can add them uniformly in the dictionary. For Weibo, we use cookies to request the target site, while preventing redirection and setting the timeout period, and then detecting the return status code after the response is received. If the 200 status code is returned directly, the cookie is valid, otherwise it may encounter 302 jumps and so on, usually jumps to the login page, then the cookie is invalid. If the cookie fails, we remove it from the hash of the cookie.
4. Interface Module
The generation module and the detection module can complete the real-time detection and update of cookies if they are run regularly. However, cookies will eventually need to be used for crawlers, and a cookie pool can be used by multiple crawlers, so we also need to define a Web interface where crawlers can access random cookies. We use flask to construct the interface, as shown in the code below:
83182789
We also need to implement a common configuration to dock different sites, so the first field of the interface link is defined as the site name, and the second field is defined as the obtained method, for example,/weibo/random is a random cookies,/zhihu/to get Weibo Random is to obtain the knowledge of the randomized cookies.
5. Dispatch module
Finally, we add a dispatch module to allow these modules to run together, the main task is to drive several modules timed operation, while the various modules need to run on different processes, the implementation is as follows:
83182789
Here are two important configurations that generate a dictionary configuration for the module class and the Test module class, as follows:
# 产生模块类,如扩展其他站点,请在此配置GENERATOR_MAP = { ‘weibo‘: ‘WeiboCookiesGenerator‘}# 测试模块类,如扩展其他站点,请在此配置TESTER_MAP = { ‘weibo‘: ‘WeiboValidTester‘}复制代码
This configuration is for the convenience of dynamic extension use, the key name is the site name, the key value is the class name. If you need to configure additional sites can be added in the dictionary, such as the extension of the site's production module, you can configure to:
GENERATOR_MAP = { ‘weibo‘: ‘WeiboCookiesGenerator‘, ‘zhihu‘: ‘ZhihuCookiesGenerator‘,}复制代码
Scheduler the dictionary and uses the eval()
dynamically created objects of each class to invoke its entry run()
method to run each module. At the same time, the multi-process of each module uses the process class in multiprocessing, and calls its start()
methods to start each process.
In addition, each module also has a module switch, we can freely set the switch in the configuration file opening and closing, as follows:
# 产生模块开关GENERATOR_PROCESS = True# 验证模块开关VALID_PROCESS = False# 接口模块开关API_PROCESS = True复制代码
Defined as true to turn on the module, which is defined as false to close the module.
At this point, our cookies are all complete. Next we turn the module on simultaneously, start the scheduler, and the console resembles the output as follows:
API interface starts running * Running on http://0.0.0.0:5000/(press CTRL + C to quit) The cookie generation process starts running the cookie detection process starts running the cookie account is being generated 14747223314 password asdf1129 is testing cookies username 14747219309Cookies Active 14747219309 testing Cookies user name 14740626332Cookies valid 14740626332 Testing Cookies User name 14740691419Cookies active 14740691419 Testing Cookies user name 14740618009Cookies valid 14740618009 Testing Cookies User name 14740636046Cookies active 14740636046 Testing Cookies user name 14747222472Cookies valid 14747222472Cookies detection Complete Verification Code Location 420 580 384 544 Successful match drag order [1, 4, 2, 3] successfully acquired to cookies {' SUHB ': ' 08j77uij4w5n_t ', ' SCF ': ' aimcucuvvhjswsbmtswkh0g4knj4k7_ U9k57yzxbqft4sfbhxq3lx4ysno9vubv841bmhfiah4ipnfqznk7w6qs. ', ' ssologinstate ': ' 1501439488 ', ' _T_WM ': ' 99b7d656220aeb9207b5db97743adc02 ', ' m_weibocn_params ': ' uicode%3d20000174 ', ' SUB ': ' _ 2a250elzqderhgebm6var8ifeztuihxvxhxoyrdv6pujbkdbelxtxkw17zoyhhj92n_rgcjmhpfv9tb8ojq. '} Successfully saved cookies copy code
Shown above is the program run the console output content, we can see each module is started normally, test module to test the cookie, generate the module to obtain the cookies have not generated the account of the cookies, the modules run in parallel, do not interfere with each other.
We can access the interface to get random cookies, as shown in.
The crawler only needs to request this interface to realize the random cookie acquisition.
Mom never had to worry about spiders being blocked! Teach you how to build a cookie pool