Python implements multi-thread capturing, and python Multithreading
Required packages:
beautifulsoup4
html5lib
image
requests
redis
PyMySQL
Install all dependency packages in pip:
pip install \Image \requests \beautifulsoup4 \html5lib \redis \PyMySQL
The runtime environment must support Chinese Characters
The python3.5 test environment does not guarantee that other runtime environments can run perfectly.
Install mysql and redis
Configurationconfig.ini
File, set mysql and redis, and enter your zhihu account
Import to databaseinit.sql
Run
Start to capture data:python get_user.py
View the number of crawlers:python check_redis.py
Effect
General idea
1. First, simulate logon to zhihu and save the cookie information for login.
2. Capture the html code of the zhihu page and wait for the next step to continue analysis and extraction of information.
3. analyze and extract the user's personalized url on the page, and put it into redis (this describes the usage of redis ideas, put the extracted user's personalized url into a hash table named already_get_user in redis, indicates the user that has been crawled. the user that has been crawled determines whether the user exists in already_get_user to remove repeated captures and put the personalized url into the queue of user_queue, when a new user needs to be captured, pop queue gets the new user)
4. Get the user's follow list and fans list and insert them to redis.
5. Retrieve new users from redis's user_queue queue and repeat Step 3.
Simulate logon to zhihu
The first is login. the login function is encapsulated in login as a package to facilitate integrated calling.
Header. It is best to set Connection to close here. Otherwise, the max retireve exceed error may occur.
The reason is that the general connection is keep-alive, but it is not closed.
# Http request headerheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/53.0.2785.143 Safari/537.36 "," Accept ":" text/html, application/xhtml + xml, application/xml; q = 0.9, image/webp ,*/*; q = 0.8 "," Host ":" www.zhihu.com "," Referer ":" https://www.zhihu.com/"," Origin ":" https://www.zhihu.com/"," Upgrade-Insecure-Requests ": "1", "Content-Type": "application/x-www-form-urlencoded; charset = UTF-8", "Pragma": "no-cache ", "Accept-Encoding": "gzip, deflate, br", 'connection': 'close'} # verify whether def check_login (self): check_url = 'https: // www.zhihu.com/settings/profile' try: login_check = self. _ session. get (check_url, headers = self. headers, timeout = 35) failed t Exception as err: print (traceback. print_exc () print (err) print ("failed to verify login, please check the network") sys. exit () print ("verify the http status code for login:" + str (login_check.status_code) if int (login_check.status_code) = 200: return True else: return False
Go to the homepage and check the http status code to verify whether or not to log on. The value 200 indicates that you have logged on. Generally, the value 304 indicates that you are redirected, so you are not logged on.
# Obtain the verification code def get_captcha (self): t = str (time. time () * 1000) captcha_url = 'HTTP: // www.zhihu.com/captcha.gif? R = '+ t + "& type = login" r = self. _ session. get (captcha_url, headers = self. headers, timeout = 35) with open('captcha.jpg ', 'wb') as f: f. write (r. content) f. close () # Use pillow's Image to display the verification code # If pillow is not installed, locate the verification code in the directory where the source code is located, and manually enter ''try: im = Image.open('captcha.jpg ') im. show () im. close () failed T: ''' print (U' go to the % s directory to find captcha.jpg and manually enter '% OS .path.abspath('captcha.jpg') captcha = input ("Enter Verification Code \ n>") return captcha
Obtain the verification code. If you log on too many times, you may need to enter the verification code. This function is implemented here.
# Get xsrfdef get_xsrf (self): index_url = 'HTTP: // www.zhihu.com '# Get the _ xsrf try: index_page = self. _ session. get (index_url, headers = self. headers, timeout = 35) failed T: print ('failed to retrieve zhihu page, please check network connection') sys. exit () html = index_page.text # Here _ xsrf returns a list BS = BeautifulSoup (html, 'html. parser ') xsrf_input = BS. find (attrs = {'name': '_ xsrf'}) pattern = r'value = \"(. *?) \ "'Print (xsrf_input) self. _ xsrf = re. findall (pattern, str (xsrf_input) return self. _ xsrf [0]
Why xsrf? xsrf is used to prevent cross-site attacks. For more information, see csrf.
After xsrf is obtained, the xsrf is stored in the cookie, and xsrf is taken as the header when the api is called. Otherwise, 403 is returned.
# Simulate login def do_login (self): try: # simulate login if self. check_login (): print ('you have logged in ') return else: if self. config. get ("zhihu_account", "username") and self. config. get ("zhihu_account", "password"): self. username = self. config. get ("zhihu_account", "username") self. password = self. config. get ("zhihu_account", "password") else: self. username = input ('enter your username \ n> ') self. password = input ("enter your password \ n>") failed t Exception as err: print (traceback. print_exc () print (err) sys. exit () if re. match (r "^ 1 \ d {10} $", self. username): print ("Mobile login \ n") post_url = 'HTTP: // www.zhihu.com/login/phone_num' postdata = {'_ xsrf': self. get_xsrf (), 'Password': self. password, 'Remember _ me': 'true', 'phone _ num': self. username,} else: print ("email login \ n") post_url = 'HTTP: // www.zhihu.com/login/email' postdata = {'_ xsrf': self. get_xsrf (), 'Password': self. password, 'Remember _ me': 'true', 'email ': self. username,} try: login_page = self. _ session. post (post_url, postdata, headers = self. headers, timeout = 35) login_text = json. loads (login_page.text.encode ('Latin-1 '). decode ('unicode-escape ') print (postdata) print (login_text) # enter the verification code r = 0 as the logon success code if login_text ['R'] = 1: sys. exit () failed T: postdata ['captcha '] = self. get_captcha () login_page = self. _ session. post (post_url, postdata, headers = self. headers, timeout = 35) print (json. loads (login_page.text.encode ('Latin-1 '). decode ('unicode-escape ') # Save the login cookie self. _ session. cookies. save ()
This is the core login function. The key is to use the requests library, which is very convenient to save to the session.
We use the singleton mode globally, and use the same requests. session object for access to ensure consistent login status.
The main logon code is
# Create a login object lo = login. login. login (self. session) # simulate Login if lo. check_login (): print ('you have logged on to ') else: if self. config. get ("zhihu_account", "username") and self. config. get ("zhihu_account", "username"): username = self. config. get ("zhihu_account", "username") password = self. config. get ("zhihu_account", "password") else: username = input ('enter your username \ n> ') password = input ("enter your password \ n>") lo. do_login (username, password)
This completes the simulated login process.
Zhihu user capture
Def _ init _ (self, threadID = 1, name = ''): # multithreading print (" Thread "+ str (threadID) +" initialization ") threading. thread. _ init _ (self) self. threadID = threadID self. name = name try: print ("Thread" + str (threadID) + "initialization successful") failed t Exception as err: print (err) print ("Thread" + str (threadID) + "failed to enable") self. threadLock = threading. lock () # obtain the configuration self. config = configparser. configParser () self. config. read ("config. ini ") # initialize session requests. adapters. DEFAULT_RETRIES = 5 self. session = requests. session () self. session. cookies = cookielib. LWPCookieJar (filename = 'cooker') self. session. keep_alive = False try: self. session. cookies. load (ignore_discard = True) failed T: print ('cookie failed to load ') finally: pass # create a login object lo = Login (self. session) lo. do_login () # initialize redis connection try: redis_host = self. config. get ("redis", "host") redis_port = self. config. get ("redis", "port") self. redis_con = redis. redis (host = redis_host, port = redis_port, db = 0) # refresh the redis database # self. redis_con.flushdb () Counter T: print ("Please install redis or check redis connection configuration") sys. exit () # initialize the database connection try: db_host = self. config. get ("db", "host") db_port = int (self. config. get ("db", "port") db_user = self. config. get ("db", "user") db_pass = self. config. get ("db", "password") db_db = self. config. get ("db", "db") db_charset = self. config. get ("db", "charset") self. db = pymysql. connect (host = db_host, port = db_port, user = db_user, passwd = db_pass, db = db_db, charset = db_charset) self. db_cursor = self. db. cursor () doesn t: print ("Check Database Configuration") sys. exit () # initialize system settings self. max_queue_len = int (self. config. get ("sys", "max_queue_len "))
This is the constructor of get_user.py. It mainly serves to initialize mysql connections, redis connections, verify login, generate global session objects, import system configurations, and enable multithreading.
# Obtain the homepage htmldef get_index_page (self): index_url = 'https: // www.zhihu.com/'try: index_html = self. session. get (index_url, headers = self. headers, timeout = 35) failed t Exception as err: # An error occurred while retrying print ("failed to get the page, retry ...... ") print (err) traceback. print_exc () return None finally: pass return index_html.text # obtain a single user details page def get_user_page (self, name_url): user_page_url = 'https: // www.zhihu.com '+ str (name_url) + '/about' try: index_html = self. session. get (user_page_url, headers = self. headers, timeout = 35) failed t Exception as err: # print ("failed name_url:" + str (name_url) + "failed to retrieve page, discard this user ") print (err) traceback. print_exc () return None finally: pass return index_html.text # retrieve fans page def get_follower_page (self, name_url): user_page_url = 'https: // www.zhihu.com '+ str (name_url) + '/followers' try: index_html = self. session. get (user_page_url, headers = self. headers, timeout = 35) failed t Exception as err: # print ("failed name_url:" + str (name_url) + "failed to retrieve page, discard this user ") print (err) traceback. print_exc () return None finally: pass return index_html.textdef get_following_page (self, name_url): user_page_url = 'https: // www.zhihu.com '+ str (name_url) +'/followers' try: index_html = self. session. get (user_page_url, headers = self. headers, timeout = 35) failed t Exception as err: # print ("failed name_url:" + str (name_url) + "failed to retrieve page, discard this user ") print (err) traceback. print_exc () return None finally: pass return index_html.text # obtain the user list on the home page and save it to redisdef get_index_page_user (self): index_html = self. get_index_page () if not index_html: return BS = BeautifulSoup (index_html, "html. parser ") self. get_xsrf (index_html) user_a = BS. find_all ("a", class _ = "author-link") # obtain the user's a tag for a in user_a: if a: self. add_wait_user (. get ('href ') else: continue
This part of the code is used to capture the html code of each page
# Add to the queue with captured users. Use redis to determine whether def add_wait_user (self, name_url) has been crawled: # determine whether self has been captured. threadLock. acquire () if not self. redis_con.hexists ('already _ get_user ', name_url): self. counter + = 1 print (name_url + "Join queue") self. redis_con.hset ('already _ get_user ', name_url, 1) self. redis_con.lpush ('user _ queue ', name_url) print ("Add user" + name_url + "to queue") self. threadLock. release () # retrieving page error removing redisdef del_already_user (self, name_url): self. threadLock. acquire () if not self. redis_con.hexists ('already _ get_user ', name_url): self. counter-= 1 self. redis_con.hdel ('already _ get_user ', name_url) self. threadLock. release ()
When a user joins redis, we call del_already_user to delete the user with an insertion error when the database is inserted.
# Analyze the fan page to retrieve all the fan users of the user # @ param follower_page get_follower_page (). Here, obtain the user hash_id and request the fan interface to obtain the fan information def get_all_follower (self, name_url): follower_page = self. get_follower_page (name_url) # determine whether to obtain the page. if not follower_page: return BS = BeautifulSoup (follower_page, 'html. parser ') # obtain the number of followers follower_num = int (BS. find ('span ', text = 'signature '). find_parent (). find ('strong '). get_text () # obtain the user's hash_id = \ json. loads (BS. select ("# zh-profile-follows-list") [0]. select (". zh-general-list ") [0]. get ('data-init ') ['params'] ['hash _ id'] # obtain the list of consumers self. get_xsrf (follower_page) # Get xsrf post_url = 'https: // www.zhihu.com/node/ProfileFollowersListV2' # start to get all the referers math. ceil (follower_num/20) * 20 for I in range (0, math. ceil (follower_num/20) * 20, 20): post_data = {'method': 'Next', 'params': json. dumps ({"offset": I, "order_by": "created", "hash_id": hash_id})} try: j = self. session. post (post_url, params = post_data, headers = self. headers, timeout = 35 ). text. encode ('Latin-1 '). decode ('unicode-escape ') pattern = re. compile (r "class = \" zm-item-link-avatar \ "[^ \"] * \ "([^ \"] *) ", re. DOTALL) j = pattern. findall (j) for user in j: user = user. replace ('\', '') self. add_wait_user (user) # Save to redis failed t Exception as err: print ("failed to get attention") print (err) traceback. print_exc () pass # obtain the list of followers def get_all_following (self, name_url): following_page = self. get_following_page (name_url) # determine whether to obtain the page. if not following_page: return BS = BeautifulSoup (following_page, 'html. parser ') # obtain the number of referers following_num = int (BS. find ('span ', text =' followed '). find_parent (). find ('strong '). get_text () # obtain the user's hash_id = \ json. loads (BS. select ("# zh-profile-follows-list") [0]. select (". zh-general-list ") [0]. get ('data-init ') ['params'] ['hash _ id'] # obtain the list of consumers self. get_xsrf (following_page) # Get xsrf post_url = 'https: // www.zhihu.com/node/ProfileFolloweesListV2' # start to get all the referers math. ceil (follower_num/20) * 20 for I in range (0, math. ceil (following_num/20) * 20, 20): post_data = {'method': 'Next', 'params': json. dumps ({"offset": I, "order_by": "created", "hash_id": hash_id})} try: j = self. session. post (post_url, params = post_data, headers = self. headers, timeout = 35 ). text. encode ('Latin-1 '). decode ('unicode-escape ') pattern = re. compile (r "class = \" zm-item-link-avatar \ "[^ \"] * \ "([^ \"] *) ", re. DOTALL) j = pattern. findall (j) for user in j: user = user. replace ('\', '') self. add_wait_user (user) # Save to redis failed t Exception as err: print ("failed to get attention") print (err) traceback. print_exc () pass
Call zhihu API to retrieve the list of followers and followers, and recursively retrieve users
Note that the header must be configured with xsrf or 403 will be thrown.
# Analyze the about page and obtain user details def get_user_info (self, name_url): about_page = self. get_user_page (name_url) # determine whether to obtain the page if not about_page: print ("failed to retrieve the user details page, Skip, name_url:" + name_url) return self. get_xsrf (about_page) BS = BeautifulSoup (about_page, 'html. parser ') # obtain the page data. try: nickname = BS. find ("a", class _ = "name "). get_text () if BS. find ("a", class _ = "name") else ''user_type = name_url [1: name_url.index ('/', 1)] self_domain = name_url [name_url.index ('/', 1) + 1:] gender = 2 if BS. find ("I", class _ = "icon-profile-female") else (1 if BS. find ("I", class _ = "icon-profile-male") else 3) follower_num = int (BS. find ('span ', text = 'signature '). find_parent (). find ('strong '). get_text () following_num = int (BS. find ('span ', text =' followed '). find_parent (). find ('strong '). get_text () agree_num = int (re. findall (R' <strong> (. *) </strong>. * agree with ', about_page) [0]) appreciate_num = int (re. findall (R' <strong> (. *) </strong>. * Thanks ', about_page) [0]) star_num = int (re. findall (R' <strong> (. *) </strong>. * Add ', about_page) [0]) into _num = int (re. findall (R' <strong> (. *) </strong>. * share ', about_page) [0]) browse_num = int (BS. find_all ("span", class _ = "zg-gray-normal") [2]. find ("strong "). get_text () trade = BS. find ("span", class _ = "business item "). get ('title') if BS. find ("span", class _ = "business item") else ''company = BS. find ("span", class _ = "Employee item "). get ('title') if BS. find ("span", class _ = "employment item") else ''school = BS. find ("span", class _ = "education item "). get ('title') if BS. find ("span", class _ = "education item") else '''major = BS. find ("span", class _ = "education-extra item "). get ('title') if BS. find ("span", class _ = "education-extra item") else ''job = BS. find ("span", class _ = "position item "). get_text () if BS. find ("span", class _ = "position item") else ''location = BS. find ("span", class _ = "location item "). get ('title') if BS. find ("span", class _ = "location item") else ''description = BS. find ("div", class _ = "bio ellipsis "). get ('title') if BS. find ("div", class _ = "bio ellipsis") else ''ask_num = int (BS. find_all ("a", class _ = 'item') [1]. find ("span "). get_text () if \ BS. find_all ("a", class _ = 'item') [1] else int (0) answer_num = int (BS. find_all ("a", class _ = 'item') [2]. find ("span "). get_text () if \ BS. find_all ("a", class _ = 'item') [2] else int (0) article_num = int (BS. find_all ("a", class _ = 'item') [3]. find ("span "). get_text () if \ BS. find_all ("a", class _ = 'item') [3] else int (0) collect_num = int (BS. find_all ("a", class _ = 'item') [4]. find ("span "). get_text () if \ BS. find_all ("a", class _ = 'item') [4] else int (0) public_edit_num = int (BS. find_all ("a", class _ = 'item') [5]. find ("span "). get_text () if \ BS. find_all ("a", class _ = 'item') [5] else int (0) replace_data = \ (pymysql. escape_string (name_url), nickname, self_domain, user_type, gender, upper, following_num, agree_num, upper, star_num, lower _num, browse_num, trade, company, school, major, job, location, pymysql. escape_string (description), ask_num, comment, article_num, collect_num, comment) replace_ SQL = ''' REPLACE INTO user (url, nickname, self_domain, user_type, gender, follower, following, agree_num, VALUES, star_num, interval _num, browse_num, trade, company, school, major, job, location, description, ask_num, answer_num, article_num, collect_num, public_edit_num) VALUES (% s, % s, % s, % s, % s) ''' try: print ("Get data: ") print (replace_data) self.db_cursor.exe cute (replace_ SQL, replace_data) self. db. commit () failed t Exception as err: print ("database insertion error") print ("data retrieved:") print (replace_data) print ("insert statement:" + self. db_cursor. _ last_executed) self. db. rollback () print (err) traceback. print_exc () failed t Exception as err: print ("Data Acquisition Error, skip user") self. redis_con.hdel ("already_get_user", name_url) self. del_already_user (name_url) print (err) traceback. print_exc () pass
Finally, go to the user's about page, analyze the page elements, and use regular expressions or beatifulsoup to analyze and capture page data.
Here we use replace into in SQL statements instead of INSERT INTO, which can effectively prevent data duplication.
# Start to capture users. The total program entry def entrance (self): while 1: if int (self. redis_con.llen ("user_queue") <1: self. get_index_page_user () else: # Get the user name_url from the output queue. redis retrieves byte, and decode it into UTF-8 name_url = str (self. redis_con.rpop ("user_queue "). decode ('utf-8') print ("processing name_url:" + name_url) self. get_user_info (name_url) if int (self. redis_con.llen ("user_queue") <= int (self. max_queue_len): self. get_all_follower (name_url) self. get_all_following (name_url) self. session. cookies. save () def run (self): print (self. name + "is running") self. entrance ()
Finally, the portal
If _ name _ = '_ main _': login = GetUser (999, "login thread") threads = [] for I in range (0, 4): m = GetUser (I, "thread" + str (I) threads. append (m) for I in range (0, 4): threads [I]. start () for I in range (0, 4): threads [I]. join ()
Here is the multi-thread enabling. You can change the number of threads to 4.
Docker
If it is too troublesome, refer to the simple basic environment I used docker to build:
Mysql and redis are both Official Images
docker run --name mysql -itd mysql:latestdocker run --name redis -itd mysql:latest
Using docker-compose to run a python image, my python docker-compose.yml:
python: container_name: python build: . ports: - "84:80" external_links: - memcache:memcache - mysql:mysql - redis:redis volumes: - /docker_containers/python/www:/var/www/html tty: true stdin_open: true extra_hosts: - "python:192.168.102.140" environment: PYTHONIOENCODING: utf-8
Final attachment source code: GITHUB https://github.com/kong36088/ZhihuSpider
Site: http://xiazai.jb51.net/201612/yuanma/ZhihuSpider (jb51.net).zip