Packages that need to be used:
beautifulsoup4
html5lib
image
requests
redis
PyMySQL
PIP installs all dependent packages:
Pip install \
Image \
requests \
beautifulsoup4 \
html5lib \
redis \
pymysql
Operating environment needs to support Chinese
The test run environment python3.5, does not guarantee the other operating environment to be able to run perfectly
Need to install MySQL and Redis
Configuration config.ini
files, set up MySQL and Redis, and fill in your known account
Import to Databaseinit.sql
Run
Start crawling data: python get_user.py
To view the number of crawls:python check_redis.py
Effect
General Ideas
1. The first is to simulate the login to know, using the Save login Cookies information
2. Grab the HTML code from the page and leave it to the next step to analyze and extract the information
3. Analyze and extract the user's personalized URL in the page, put Redis (here Special description of the Redis of the use of ideas, will be extracted to the user's personalized URL into a redis named Already_get_user hash table, said the crawled user, For the crawled user to determine whether exists in the Already_get_user to remove duplicate crawl, while the personalized URL into the User_queue queue, the need to crawl new users when the pop queue to obtain new users)
4. Get the user's attention list and fan list and continue inserting into the Redis
5. Get new user from Redis user_queue queue continue repeat step 3
Analog Landing know
First is the landing, landing function as a package encapsulated in login, convenient integration call
Header section, here Connection best set to close, or you may encounter max Retireve exceed error
The reason is that the common connection is keep-alive, but it's not closed.
# header headers = {"User-agent": "mozilla/5.0 for HTTP Requests" (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/53.0.2785.143 safari/537.36 ", Accept": "text/html,application/ xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 "," Host ":" Www.zhihu.com "," Referer ":" https://www.zhihu.com/ "," Origin ":" https://www.zhihu.com/"," upgrade-insecure-requests ":" 1 "," Content-type ":" Application/x-www-form-url encoded; Charset=utf-8 "," Pragma ":" No-cache "," accept-encoding ":" gzip, deflate, BR ", ' Connection ': ' Close '} # Verify Login def Check_login (self): Check_url = ' https://www.zhihu.com/settings/profile ' Try:login_check = Self.__session.get (chec K_url, Headers=self.headers, timeout=35) except Exception as Err:print (Traceback.print_exc ()) print (ERR) PR Int ("Verify login failed, check network") Sys.exit () print ("Verify that the HTTP status code for login is:" + str (login_check.status_code)) if int (Login_check).
Status_code) = = 200:return True else:return False
Go to the homepage to check the HTTP status code to verify that the landing, 200 for the landing, the general 304 is redirected so that is not landing
# Get authentication Code
def get_captcha (self):
t = str (time.time () * 1000)
Captcha_url = ' http://www.zhihu.com/captcha.gif ? r= ' + t + "&type=login"
r = Self.__session.get (Captcha_url, Headers=self.headers, timeout=35) with
open (' Captcha.jpg ', ' WB ') as F:
F.write (r.content)
f.close ()
# display Authenticode with pillow Image
If no pillow is installed Go to the directory where the source code is located to locate the CAPTCHA and then manually enter
' try:
im = Image.open (' captcha.jpg ')
im.show ()
im.close
() Except: '
print (U ' please go to%s directory to find captcha.jpg manual input '% os.path.abspath (' captcha.jpg '))
captcha = input ("Please enter the authentication code \ n > ") return
Captcha
Gets the method of the authentication code. When the number of logins is too many may require the input verification code, here to implement this function
# get XSRF
def get_xsrf (self):
index_url = ' http://www.zhihu.com '
# get the _XSRF try that you need to log in
:
index_ page = Self.__session.get (Index_url, Headers=self.headers, timeout=35)
except:
print (' Get known page failure, check network connection ')
sys.exit ()
html = index_page.text
# Here The _xsrf returns a list
BS = BeautifulSoup (HTML, ' Html.parser ')
xsrf_input = bs.find (attrs={' name ': ' _XSRF '}) pattern
= R ' value=\ ' (. *?) \ "'
print (xsrf_input)
self.__xsrf = Re.findall (Pattern, str (xsrf_input)) return
self.__xsrf[0]
Get xsrf, why to get xsrf, because XSRF is a means to prevent the spread of attacks, the specific introduction can be seen here CSRF
Save the XSRF into a cookie after acquiring the XSRF, and take xsrf as the head when invoking the API, or you will return 403
# Simulate Login def do_login (self): try: # Simulate Login if Self.check_login (): Print (' You are logged in ') return else: If Self.config.get ("Zhihu_account", "username") and Self.config.get ("Zhihu_account", "password"): Self.username
= Self.config.get ("Zhihu_account", "username") Self.password = Self.config.get ("Zhihu_account", "password") Else:self.username = input (' Please enter your username \n> ') Self.password = input ("Please enter your password \n>") except Exception As Err:print (Traceback.print_exc ()) print (ERR) sys.exit () if Re.match (r "^1\d{10}$", Self.username): PR
Int ("Mobile login \ n") Post_url = ' Http://www.zhihu.com/login/phone_num ' postdata = {' _xsrf ': self.get_xsrf (), ' Password ': Self.password, ' Remember_me ': ' true ', ' phone_num ': Self.username,} else:print ("Mailbox login \ n ") Post_url = ' Http://www.zhihu.com/login/email ' postdata = {' _xsrf ': self.get_xsrf (), ' Password ': SE Lf.password, ' RemeMber_me ': ' true ', ' email ': self.username,} try:login_page = Self.__session.post (Post_url, PostData, head
Ers=self.headers, timeout=35) Login_text = Json.loads (Login_page.text.encode (' latin-1 '). Decode (' Unicode-escape '))
Print (postdata) print (login_text) # requires input verification code r = 0 for landing Success code if login_text[' r '] = = 1:sys.exit () except: postdata[' captcha '] = Self.get_captcha () login_page = Self.__session.post (Post_url, PostData, Headers=self.headers, timeout=35) Print (Json.loads login_page.text.encode (' latin-1 '). Decode (' Unicode-escape ')) # Save Login Cookie Self.__ses
Sion.cookies.save ()
This is the core of the landing function, it is very important to use the requests library, very convenient to save to the session
We are here in a single example mode, unified use of the same Requests.session object to access the function, maintain the consistency of the login state
The last major call to login code is
# Create Login Object
lo = Login.login.Login (self.session)
# Simulate login
if Lo.check_login ():
print (' You are logged in ')
else:
if Self.config.get ("Zhihu_account", "username") and Self.config.get ("Zhihu_account", "username"):
Username = self.config.get ("Zhihu_account", "username")
password = self.config.get ("Zhihu_account", "password")
Else:
username = input (' Please enter your username \n> ')
password = input ("Please enter your password \n>")
Lo.do_login ( Username, password)
Knowing that the analog landing is done.
Knowing that users crawl
def __init__ (self, threadid=1, name= '): # multi-threaded print (thread + str (ThreadID) + Initialize) threading. Thread.__init__ (self) self.threadid = ThreadID Self.name = name Try:print ("thread" + str (ThreadID) + "initialization succeeded") exc EPT Exception as Err:print (err) print (thread + str (ThreadID) + open failed) Self.threadlock = threading. Lock () # Gets the configuration Self.config = Configparser. Configparser () self.config.read ("Config.ini") # Initialize Session Requests.adapters.DEFAULT_RETRIES = 5 Self.session = RE Quests. Session () Self.session.cookies = Cookielib. Lwpcookiejar (filename= ' cookie ') self.session.keep_alive = False Try:self.session.cookies.load (ignore_discard=true Except:print (' Cookie failed to load ') Finally:pass # Create Login Object lo = login (self.session) lo.do_login () # Initialize R Edis Connection Try:redis_host = Self.config.get ("Redis", "host") Redis_port = Self.config.get ("Redis", "Port") self . Redis_con = Redis. Redis (Host=redis_host, Port=redis_port, db=0) # Refresh Redis Library # Self.redis_con.flushdb () except:print ("Please install Redis or check redis connection Configuration") Sys.exit () # Initialize database connection Try:db_host = Sel
F.config.get ("DB", "host") Db_port = Int (Self.config.get ("db", "Port")) Db_user = Self.config.get ("db", "user") Db_pass = Self.config.get ("db", "password") db_db = Self.config.get ("db", "db") Db_charset = Self.config.get ("db
"," charset ") self.db = Pymysql.connect (Host=db_host, Port=db_port, User=db_user, Passwd=db_pass, db=db_db, Charset=db_charset) Self.db_cursor = Self.db.cursor () except:print ("Please check Database Configuration") Sys.exit () # Initialize System setup
Place self.max_queue_len = Int (self.config.get ("sys", "Max_queue_len"))
This is the get_user.py constructor, the main function is to initialize MySQL connection, redis connection, verify login, generate global Session object, import system configuration, open multithreading.
# get home HTML def get_index_page (self): Index_url = ' https://www.zhihu.com/' try:index_html = Self.session.get (Inde X_url, Headers=self.headers, timeout=35) except Exception as err: # Unexpected retry print ("Get page failed, retrying ...") Print ( ERR) Traceback.print_exc () return None Finally:pass return Index_html.text # Get a single User detail page def get_user_page (Self, name_url): User_page_url = ' https://www.zhihu.com ' + str (name_url) + '/about ' try:index_html = Self.sessio N.get (User_page_url, Headers=self.headers, timeout=35) except Exception as err: # Unexpected retry print ("Failed Name_url:" + S TR (name_url) + "Get page failed, discard this user") print (Err) traceback.print_exc () return None finally:pass return index _html.text # get Fan page def get_follower_page (self, name_url): User_page_url = ' https://www.zhihu.com ' + str (name_url) + '/F Ollowers ' try:index_html = Self.session.get (User_page_url, Headers=self.headers, timeout=35) except Exception as ERR: # Unexpected retry prInt ("Failed Name_url:" + str (name_url) + "Get page failed, discard this user") print (Err) traceback.print_exc () return None finally: Pass return Index_html.text def get_following_page (self, name_url): User_page_url = ' https://www.zhihu.com ' + str (NA Me_url) + '/followers ' try:index_html = Self.session.get (User_page_url, Headers=self.headers, timeout=35) except Exception as err: # Unexpected retry print (failed Name_url: + str (name_url) + "Get page failed, discard this user") print (ERR) Traceback.prin
T_exc () return None Finally:pass return Index_html.text # Get the list of users on the home page, and deposit Redis def get_index_page_user (self): index_html = Self.get_index_page () if not index_html:return BS = BeautifulSoup (index_html, "Html.parser") self . GET_XSRF (index_html) user_a = Bs.find_all ("A", class_= "Author-link") # Get user's a tag for a in User_a:if a:self
. Add_wait_user (A.get (' href ')) else:continue
This part of the code is the HTML code that is used to crawl each page
# Join with crawl user queue, first use Redis to determine if it has been crawled by
def add_wait_user (self, Name_url):
# To determine if the Self.threadLock.acquire has been crawled
()
if not self.redis_con.hexists (' Already_get_user ', Name_url):
self.counter + + 1
print (Name_url + "Join queue") C7/>self.redis_con.hset (' Already_get_user ', Name_url, 1)
Self.redis_con.lpush (' User_queue ', Name_url)
Print (add user + Name_url + to queue)
self.threadLock.release ()
# Get page error remove Redis
def del_already_user (self, Name_url):
self.threadLock.acquire ()
If not self.redis_con.hexists (' Already_get_user ', Name_url):
Self.counter-= 1
self.redis_con.hdel (' Already_get_user ', Name_url)
self.threadLock.release ()
User joins Redis operation, we call Del_already_user Delete Insert error when database insert error
# Analysis of the fan page get the user all fan user # @param follower_page get_follower_page () to get the page, here get the user hash_id request fan interface to get fan information def get_all_follower ( Self, name_url): Follower_page = Self.get_follower_page (name_url) # to determine whether to get to the page if not Follower_page:return BS = BeautifulSoup (follower_page, ' Html.parser ') # Get the number of followers follower_num = Int (bs.find (' span ', text= ' followers '). Find_parent (). f IND (' strong '). Get_text ()) # Gets the user's hash_id hash_id = \ json.loads (bs.select ("#zh-profile-follows-list") [0].select (". Zh-general-list ") [0].get (' Data-init ')] [' params '] [' hash_id '] # get the list of followers self.get_xsrf (follower_page) # get X SRF post_url = ' https://www.zhihu.com/node/ProfileFollowersListV2 ' # start to get all the followers Math.ceil (FOLLOWER_NUM/20) *20 for I In range (0, Math.ceil (FOLLOWER_NUM/20) *: Post_data = {' method ': ' Next ', ' params ': Json.dumps ({ "Offset": I, "order_by": "Created", "hash_id": hash_id})} try:j = Self.session.post (Post_url, Params=post_ Data, Headers=self.headers,timeout=35). Text.encode (' latin-1 '). Decode (' unicode-escape ') pattern = Re.compile (r "class=\" Zm-item -link-avatar\ "[^\"]*\ "([^\"]*) ", Re. Dotall) J = Pattern.findall (j) for user in J:user = user.replace (' \ \ ', ') self.add_wait_user
(user) # Save to Redis except Exception as Err:print (get in focus on failure) print (ERR) traceback.print_exc ()
Pass # Get attention list def get_all_following (self, name_url): Following_page = Self.get_following_page (name_url) # to determine whether to get to the page If not following_page:return BS = BeautifulSoup (following_page, ' Html.parser ') # Get the number of followers following_num = Int ( Bs.find (' span ', text= ' attention '). Find_parent (). Find (' strong '). Get_text ()) # Get the user's hash_id hash_id = \ json.loads (bs.sele CT ("#zh-profile-follows-list") [0].select (". Zh-general-list") [0].get (' Data-init ')] [' params '] [' hash_id '] # Received Take the list of followers self.get_xsrf (following_page) # get xsrf Post_url = ' Https://www.zhihu.com/node/ProfileFolloweesListV2 ' # start getting all the followers Math.ceil (FOLLOWER_NUM/20) *20 for I in range (0, Math.ceil (FOLLOWING_NUM/20) *): Post_data
= {' method ': ' Next ', ' params ': json.dumps ({' Offset ': I, ' order_by ': ' Created ', ' hash_id ': hash_id})} Try:j = Self.session.post (Post_url, Params=post_data, Headers=self.headers, timeout=35). Text.encode (' Lat In-1 '). Decode (' unicode-escape ') pattern = Re.compile (r "class=\" Zm-item-link-avatar\ "[^\"]*\ "(^\"]* "), RE.D Otall) J = Pattern.findall (j) for user in J:user = user.replace (' \ \ ', ') Self.add_wait_user ( User) # Save to Redis except Exception as Err:print (get in focus on failure) print (ERR) traceback.print_exc () p
Ass
Invoke the known API, get all the attention to the user list and the list of fan users, recursively get users
here is to note that the head should remember to bring the XSRF or it will throw 403
# Analyze About page, get user details Def get_user_info (self, name_url): About_page = Self.get_user_page (name_url) # to determine whether to get to the page if no T About_page:print ("Get User Details page failed, skip, Name_url:" + name_url) return self.get_xsrf (about_page) BS = BeautifulSoup ( About_page, ' Html.parser ') # get the specific data of the page Try:nickname = Bs.find ("A", class_= "name"). Get_text () If Bs.find ("a", Clas s_= "name") Else ' User_type = Name_url[1:name_url.index ('/', 1)] Self_domain = Name_url[name_url.index ('/', 1) + 1:] gender = 2 if Bs.find ("I", class_= "icon Icon-profile-female") Else (1 if bs.find ("I", class_= "icon Icon-profile-ma") Le ") Else 3" follower_num = Int (bs.find (' span ', text= ' followers '). Find_parent (). Find (' strong '). Get_text () Following_num = Int (Bs.find (' span ', text= ' attention '). Find_parent (). Find (' strong '). Get_text ()) agree_num = Int (Re.findall (R ' <strong& gt; (. *) </strong>.* approval ', About_page] [0]) appreciate_num = Int (Re.findall (R ' <strong> (. *) </strong>.* Thanks ', about_page] [0]) Star_num = Int (Re.findall (R ' <strong> (. *) </strong>.* collection ', about_page) [0]) share_num = Int (Re.findall (R ' < Strong> (. *) </strong>.* sharing ', about_page) [0]) browse_num = Int (Bs.find_all ("span", class_= "Zg-gray-normal") [2
].find ("Strong"). Get_text ()) trade = Bs.find ("span", class_= "Business item"). Get (' title ') If Bs.find ("span", class_= "Business item") Else ' company = Bs.find ("span", class_= "Employment item"). Get ( ' title ') If Bs.find ("span", class_= "Employment item") Else ' school = Bs.fin D ("span", class_= "Education item"). Get (' title ') If Bs.find ("span", class_= "Educati
On item "] Else ' major = Bs.find (" span ", class_=" Education-extra item "). Get (' title ') If Bs.find (" span ", class_= "Education-extra item") Else ' job = Bs.find ("span", class_= "position item"). Get
_text () If Bs.find ("span"), class_= "Position item") Else ' location = Bs.find ("span", class_= "Location item"). Get (' title ') If Bs.find ("span", class_= "Location item") Else ' description = BS . Find ("div", class_= "bio-ellipsis"). Get (' title ') if Bs.find ("div", class_= "Bio E Llipsis ') Else ' ask_num = Int (Bs.find_all ("A", class_= ' item ') [1].find ("span"). Get_text ()) if \ Bs.find_all ("a" , class_= ' item '] [1] Else int (0) answer_num = Int (Bs.find_all ("A", class_= ' item ') [2].find ("span"). Get_text ()) If \ Bs.find_all ("A", class_= ' item ') [2] else int (0) article_num = Int (Bs.find_all ("A", class_= ' item ') [3 ].find ("span"). Get_text ()) if \ Bs.find_all ("A", class_= ' item ') [3] else int (0) collect_num = Int (Bs.find_all ("a") , class_= ' item ') [4].find ("span"). Get_text ()) if \ Bs.find_all ("A", class_= ' item ') [4] else int (0) Public_edit_num = Int (Bs.find_all ("a"),class_= ' item ') [5].find ("span"). Get_text ()) if \ Bs.find_all ("A", class_= ' item ') [5] Else int (0) Replace_data = \ (Pymysql.escape_string (Name_url), nickname, Self_domain, User_type, Gender, Follower_num, Following_num, Agre E_num, Appreciate_num, Star_num, Share_num, Browse_num, trade, company, school, Major, job, location, Pymysql.escap E_string (description), Ask_num, Answer_num, Article_num, Collect_num, public_edit_num) replace_sql = ' Replace into user (Url,nickname,self_domain,user_type, gender, follower,following,agree_num,appreciate_num,s Tar_num,share_num,browse_num, Trade,company,school,major,job,location,description, Ask_num,answer_nu M,article_num,collect_num,public_edit_num) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,% s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) ' Try:print ("Get Data:") print (Replace_data) self.db_ Cursor.execute (replace_sQL, Replace_data) self.db.commit () except Exception as Err:print ("Insert Database Error") print ("Get Data:") p Rint (replace_data) print (INSERT statement: + self.db_cursor._last_executed) self.db.rollback () print (ERR) TR Aceback.print_exc () except Exception as Err:print ("Get data error, skip user") Self.redis_con.hdel ("Already_get_user", Name_u
RL) Self.del_already_user (name_url) print (ERR) traceback.print_exc () Pass
Finally, to the user's about page, analyze the page elements, using regular or Beatifulsoup to analyze the data to crawl the page
Here we have SQL statements with replace into instead of insert into, which is a good way to prevent duplication of data problems
# Start crawling user, program total entry
def entrance (self): while
1:
if int (Self.redis_con.llen ("User_queue")) < 1:
Self.get_index_page_user ()
else:
# out of queue get user Name_url Redis out is byte, to decode utf-8
name_url = str ( Self.redis_con.rpop ("User_queue"). Decode (' Utf-8 ')
print ("Processing name_url:" + name_url)
Self.get_user_info ( Name_url)
if int (Self.redis_con.llen ("User_queue")) <= Int (self.max_queue_len):
Self.get_all_ Follower (Name_url)
self.get_all_following (name_url)
Self.session.cookies.save ()
def run (self):
Print (Self.name + "is running")
self.entrance ()
Finally, the entrance
if __name__ = = ' __main__ ':
login = GetUser (999, "Landing thread")
threads = []
for I in range (0, 4):
m = GetUser (i, "Thread" + str (i))
threads.append (m)
for I in range (0, 4):
Threads[i].start () to
I in range (0, 4):
Threads[i].join ()
Here is the multithreading of the open, need to open how many threads to change the number of 4 can be
Docker
The trouble can refer to me with Docker simple to build a basic environment:
Both MySQL and Redis are official mirrors.
Docker run--name mysql-itd mysql:latest
Docker run--name REDIS-ITD
And then using Docker-compose to run the python mirror, my python docker-compose.yml:
Python:
container_name:python build
:.
Ports:
-"84:80"
external_links:
-Memcache:memcache
-mysql:mysql
-Redis:redis
Volumes:
-/docker_containers/python/www:/var/www/html
tty:true
stdin_open:true
extra_hosts :
-"python:192.168.102.140"
Environment:
pythonioencoding:utf-8
Finally attached source code: GITHUB Https://github.com/kong36088/ZhihuSpider
Site Download Address: Http://xiazai.jb51.net/201612/yuanma/ZhihuSpider (jb51.net). zip