How to implement automatic acquisition of Web Crawler cookies and automatic update of expired cookies

Source: Internet
Author: User

How to implement automatic acquisition of Web Crawler cookies and automatic update of expired cookies

In this document, automatic acquisition of cookies and automatic update of expired cookies are implemented.

A lot of information on social networking websites can be obtained only after logon. Taking Weibo as an example, if you do not log on to an account, you can only view the top 10 Weibo posts of big V. To maintain the logon status, you must use cookies. Log on to www.weibo.cn as an example:

Enter: http://login.weibo.cn/login/ in chrome

Analyze the Headers request returned in the console. Several groups of returned cookies are displayed in weibo.cn.

Steps:

1. Use selenium to automatically log on to get the cookie and save it to the file;

2. Read the cookie and compare the validity period of the cookie. If it expires, perform Step 1 again;

3. Fill in cookies when requesting other webpages to maintain the logon status.

1. Obtain the cookie online

Use selenium + PhantomJS to simulate browser logon and obtain the cookie;

There are usually multiple cookies. One by one, the cookies are stored in files with the. weibo suffix.

Def get_cookie_from_network (): from selenium import webdriver url_login = 'HTTP: // login.weibo.cn/login/' driver = webdriver. phantomJS () driver. get (url_login) driver. find_element_by_xpath ('// input [@ type = "text"]'). send_keys ('your _ weibo_accout ') # change it to your Weibo account driver. find_element_by_xpath ('// input [@ type = "password"]'). send_keys ('your _ weibo_password ') # change it to your Weibo password driver. find_element_by_xpath ('// input [@ type = "submit"]'). click () # click Login # obtain cookie information cookie_list = driver. get_cookies () print cookie_list cookie_dict = {} for cookie in cookie_list: # Write File f = open (cookie ['name'] + '. weibo ', 'w') pickle. dump (cookie, f) f. close () if cookie. has_key ('name') and cookie. has_key ('value'): cookie_dict [cookie ['name'] = cookie ['value'] return cookie_dict

2. Obtain the cookie from the file.

Traverse files ending with. weibo from the current directory, that is, cookie files. Use pickle to decompress the package into dict, compare the expiry value with the current time, and return NULL if it expires;

def get_cookie_from_cache(): cookie_dict = {} for parent, dirnames, filenames in os.walk('./'):  for filename in filenames:   if filename.endswith('.weibo'):    print filename    with open(self.dir_temp + filename, 'r') as f:     d = pickle.load(f)     if d.has_key('name') and d.has_key('value') and d.has_key('expiry'):      expiry_date = int(d['expiry'])      if expiry_date > (int)(time.time()):       cookie_dict[d['name']] = d['value']      else:       return {} return cookie_dict

3. If the cache cookie expires, the cookie is retrieved from the network again.

def get_cookie(): cookie_dict = get_cookie_from_cache() if not cookie_dict:  cookie_dict = get_cookie_from_network() return cookie_dict

4. Send a cookie to another Weibo Homepage

Def get_weibo_list (self, user_id): import requests from bs4 import BeautifulSoup as bs cookdic = get_cookie () url = 'HTTP: // weibo.cn/stocknews88'headers = {'user-agent ': 'mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/100'} timeout = 5 r = requests. get (url, headers = headers, cookies = cookdic, timeout = timeout) soup = bs (r. text, 'lxml ')... # Use BeautifulSoup to parse webpages...

Summary

The above section describes how to implement the automatic acquisition of Web Crawler cookies and the automatic update of expired cookies. I hope this will be helpful to you. If you have any questions, please leave a message for me, the editor will reply to you in a timely manner. Thank you very much for your support for the help House website!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.