Python implements multithreaded crawling with user

Python implements multithreaded crawling with user _python

Last Update:2017-01-18 Source: Internet

Author: User

Tags redis docker run install redis

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Packages that need to be used:

beautifulsoup4
html5lib
image
requests
redis
PyMySQL

PIP installs all dependent packages:

Pip install \
Image \
requests \
beautifulsoup4 \
html5lib \
redis \
pymysql

Operating environment needs to support Chinese

The test run environment python3.5, does not guarantee the other operating environment to be able to run perfectly

Need to install MySQL and Redis

Configuration config.ini files, set up MySQL and Redis, and fill in your known account

Import to Databaseinit.sql

Run

Start crawling data: python get_user.py
To view the number of crawls:python check_redis.py

Effect

General Ideas

1. The first is to simulate the login to know, using the Save login Cookies information
2. Grab the HTML code from the page and leave it to the next step to analyze and extract the information
3. Analyze and extract the user's personalized URL in the page, put Redis (here Special description of the Redis of the use of ideas, will be extracted to the user's personalized URL into a redis named Already_get_user hash table, said the crawled user, For the crawled user to determine whether exists in the Already_get_user to remove duplicate crawl, while the personalized URL into the User_queue queue, the need to crawl new users when the pop queue to obtain new users)
4. Get the user's attention list and fan list and continue inserting into the Redis
5. Get new user from Redis user_queue queue continue repeat step 3

Analog Landing know

First is the landing, landing function as a package encapsulated in login, convenient integration call

Header section, here Connection best set to close, or you may encounter max Retireve exceed error
The reason is that the common connection is keep-alive, but it's not closed.

# header headers = {"User-agent": "mozilla/5.0 for HTTP Requests" (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/53.0.2785.143 safari/537.36 ", Accept": "text/html,application/ xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 "," Host ":" Www.zhihu.com "," Referer ":" https://www.zhihu.com/ "," Origin ":" https://www.zhihu.com/"," upgrade-insecure-requests ":" 1 "," Content-type ":" Application/x-www-form-url encoded; Charset=utf-8 "," Pragma ":" No-cache "," accept-encoding ":" gzip, deflate, BR ", ' Connection ': ' Close '} # Verify Login def Check_login (self): Check_url = ' https://www.zhihu.com/settings/profile ' Try:login_check = Self.__session.get (chec K_url, Headers=self.headers, timeout=35) except Exception as Err:print (Traceback.print_exc ()) print (ERR) PR Int ("Verify login failed, check network") Sys.exit () print ("Verify that the HTTP status code for login is:" + str (login_check.status_code)) if int (Login_check).

 Status_code) = = 200:return True else:return False

Go to the homepage to check the HTTP status code to verify that the landing, 200 for the landing, the general 304 is redirected so that is not landing

# Get authentication Code
def get_captcha (self):
  t = str (time.time () * 1000)
  Captcha_url = ' http://www.zhihu.com/captcha.gif ? r= ' + t + "&type=login"
  r = Self.__session.get (Captcha_url, Headers=self.headers, timeout=35) with
  open (' Captcha.jpg ', ' WB ') as F:
    F.write (r.content)
    f.close ()
    # display Authenticode with pillow Image
    If no pillow is installed Go to the directory where the source code is located to locate the CAPTCHA and then manually enter
  ' try:
    im = Image.open (' captcha.jpg ')
    im.show ()
    im.close
  () Except: '
  print (U ' please go to%s directory to find captcha.jpg manual input '% os.path.abspath (' captcha.jpg '))
  captcha = input ("Please enter the authentication code \ n > ") return
  Captcha

Gets the method of the authentication code. When the number of logins is too many may require the input verification code, here to implement this function

# get XSRF
def get_xsrf (self):
  index_url = ' http://www.zhihu.com '
  # get the _XSRF try that you need to log in
  :
    index_ page = Self.__session.get (Index_url, Headers=self.headers, timeout=35)
  except:
    print (' Get known page failure, check network connection ')
    sys.exit ()
  html = index_page.text
  # Here The _xsrf returns a list
  BS = BeautifulSoup (HTML, ' Html.parser ')
  xsrf_input = bs.find (attrs={' name ': ' _XSRF '}) pattern
  = R ' value=\ ' (. *?) \ "'
  print (xsrf_input)
  self.__xsrf = Re.findall (Pattern, str (xsrf_input)) return
  self.__xsrf[0]

Get xsrf, why to get xsrf, because XSRF is a means to prevent the spread of attacks, the specific introduction can be seen here CSRF
Save the XSRF into a cookie after acquiring the XSRF, and take xsrf as the head when invoking the API, or you will return 403

# Simulate Login def do_login (self): try: # Simulate Login if Self.check_login (): Print (' You are logged in ') return else:  If Self.config.get ("Zhihu_account", "username") and Self.config.get ("Zhihu_account", "password"): Self.username
      = Self.config.get ("Zhihu_account", "username") Self.password = Self.config.get ("Zhihu_account", "password")  Else:self.username = input (' Please enter your username \n> ') Self.password = input ("Please enter your password \n>") except Exception As Err:print (Traceback.print_exc ()) print (ERR) sys.exit () if Re.match (r "^1\d{10}$", Self.username): PR
      Int ("Mobile login \ n") Post_url = ' Http://www.zhihu.com/login/phone_num ' postdata = {' _xsrf ': self.get_xsrf (), ' Password ': Self.password, ' Remember_me ': ' true ', ' phone_num ': Self.username,} else:print ("Mailbox login \ n ") Post_url = ' Http://www.zhihu.com/login/email ' postdata = {' _xsrf ': self.get_xsrf (), ' Password ': SE Lf.password, ' RemeMber_me ': ' true ', ' email ': self.username,} try:login_page = Self.__session.post (Post_url, PostData, head
    Ers=self.headers, timeout=35) Login_text = Json.loads (Login_page.text.encode (' latin-1 '). Decode (' Unicode-escape '))
    Print (postdata) print (login_text) # requires input verification code r = 0 for landing Success code if login_text[' r '] = = 1:sys.exit () except:  postdata[' captcha '] = Self.get_captcha () login_page = Self.__session.post (Post_url, PostData, Headers=self.headers, timeout=35) Print (Json.loads login_page.text.encode (' latin-1 '). Decode (' Unicode-escape ')) # Save Login Cookie Self.__ses
 Sion.cookies.save ()

This is the core of the landing function, it is very important to use the requests library, very convenient to save to the session
We are here in a single example mode, unified use of the same Requests.session object to access the function, maintain the consistency of the login state

The last major call to login code is

# Create Login Object
lo = Login.login.Login (self.session)
# Simulate login
if Lo.check_login ():
  print (' You are logged in ')
else:
  if Self.config.get ("Zhihu_account", "username") and Self.config.get ("Zhihu_account", "username"):
    Username = self.config.get ("Zhihu_account", "username")
    password = self.config.get ("Zhihu_account", "password")
  Else:
    username = input (' Please enter your username \n> ')
    password = input ("Please enter your password \n>")
  Lo.do_login ( Username, password)

Knowing that the analog landing is done.

Knowing that users crawl

def __init__ (self, threadid=1, name= '): # multi-threaded print (thread + str (ThreadID) + Initialize) threading. Thread.__init__ (self) self.threadid = ThreadID Self.name = name Try:print ("thread" + str (ThreadID) + "initialization succeeded") exc EPT Exception as Err:print (err) print (thread + str (ThreadID) + open failed) Self.threadlock = threading. Lock () # Gets the configuration Self.config = Configparser. Configparser () self.config.read ("Config.ini") # Initialize Session Requests.adapters.DEFAULT_RETRIES = 5 Self.session = RE Quests. Session () Self.session.cookies = Cookielib. Lwpcookiejar (filename= ' cookie ') self.session.keep_alive = False Try:self.session.cookies.load (ignore_discard=true Except:print (' Cookie failed to load ') Finally:pass # Create Login Object lo = login (self.session) lo.do_login () # Initialize R Edis Connection Try:redis_host = Self.config.get ("Redis", "host") Redis_port = Self.config.get ("Redis", "Port") self . Redis_con = Redis. Redis (Host=redis_host, Port=redis_port, db=0) # Refresh Redis Library # Self.redis_con.flushdb () except:print ("Please install Redis or check redis connection Configuration") Sys.exit () # Initialize database connection Try:db_host = Sel
    F.config.get ("DB", "host") Db_port = Int (Self.config.get ("db", "Port")) Db_user = Self.config.get ("db", "user") Db_pass = Self.config.get ("db", "password") db_db = Self.config.get ("db", "db") Db_charset = Self.config.get ("db
                 "," charset ") self.db = Pymysql.connect (Host=db_host, Port=db_port, User=db_user, Passwd=db_pass, db=db_db, Charset=db_charset) Self.db_cursor = Self.db.cursor () except:print ("Please check Database Configuration") Sys.exit () # Initialize System setup
 Place self.max_queue_len = Int (self.config.get ("sys", "Max_queue_len"))

This is the get_user.py constructor, the main function is to initialize MySQL connection, redis connection, verify login, generate global Session object, import system configuration, open multithreading.

# get home HTML def get_index_page (self): Index_url = ' https://www.zhihu.com/' try:index_html = Self.session.get (Inde X_url, Headers=self.headers, timeout=35) except Exception as err: # Unexpected retry print ("Get page failed, retrying ...") Print ( ERR) Traceback.print_exc () return None Finally:pass return Index_html.text # Get a single User detail page def get_user_page (Self, name_url): User_page_url = ' https://www.zhihu.com ' + str (name_url) + '/about ' try:index_html = Self.sessio N.get (User_page_url, Headers=self.headers, timeout=35) except Exception as err: # Unexpected retry print ("Failed Name_url:" + S TR (name_url) + "Get page failed, discard this user") print (Err) traceback.print_exc () return None finally:pass return index _html.text # get Fan page def get_follower_page (self, name_url): User_page_url = ' https://www.zhihu.com ' + str (name_url) + '/F Ollowers ' try:index_html = Self.session.get (User_page_url, Headers=self.headers, timeout=35) except Exception as ERR: # Unexpected retry prInt ("Failed Name_url:" + str (name_url) + "Get page failed, discard this user") print (Err) traceback.print_exc () return None finally: Pass return Index_html.text def get_following_page (self, name_url): User_page_url = ' https://www.zhihu.com ' + str (NA Me_url) + '/followers ' try:index_html = Self.session.get (User_page_url, Headers=self.headers, timeout=35) except Exception as err: # Unexpected retry print (failed Name_url: + str (name_url) + "Get page failed, discard this user") print (ERR) Traceback.prin
  T_exc () return None Finally:pass return Index_html.text # Get the list of users on the home page, and deposit Redis def get_index_page_user (self): index_html = Self.get_index_page () if not index_html:return BS = BeautifulSoup (index_html, "Html.parser") self . GET_XSRF (index_html) user_a = Bs.find_all ("A", class_= "Author-link") # Get user's a tag for a in User_a:if a:self
 . Add_wait_user (A.get (' href ')) else:continue

This part of the code is the HTML code that is used to crawl each page

# Join with crawl user queue, first use Redis to determine if it has been crawled by
def add_wait_user (self, Name_url):
  # To determine if the Self.threadLock.acquire has been crawled
  ()
  if not self.redis_con.hexists (' Already_get_user ', Name_url):
    self.counter + + 1
    print (Name_url + "Join queue") C7/>self.redis_con.hset (' Already_get_user ', Name_url, 1)
    Self.redis_con.lpush (' User_queue ', Name_url)
    Print (add user + Name_url + to queue)
  self.threadLock.release ()
# Get page error remove Redis
def del_already_user (self, Name_url):
  self.threadLock.acquire ()
  If not self.redis_con.hexists (' Already_get_user ', Name_url):
    Self.counter-= 1
    self.redis_con.hdel (' Already_get_user ', Name_url)
  self.threadLock.release ()

User joins Redis operation, we call Del_already_user Delete Insert error when database insert error

# Analysis of the fan page get the user all fan user # @param follower_page get_follower_page () to get the page, here get the user hash_id request fan interface to get fan information def get_all_follower ( Self, name_url): Follower_page = Self.get_follower_page (name_url) # to determine whether to get to the page if not Follower_page:return BS = BeautifulSoup (follower_page, ' Html.parser ') # Get the number of followers follower_num = Int (bs.find (' span ', text= ' followers '). Find_parent (). f IND (' strong '). Get_text ()) # Gets the user's hash_id hash_id = \ json.loads (bs.select ("#zh-profile-follows-list") [0].select (". Zh-general-list ") [0].get (' Data-init ')] [' params '] [' hash_id '] # get the list of followers self.get_xsrf (follower_page) # get X SRF post_url = ' https://www.zhihu.com/node/ProfileFollowersListV2 ' # start to get all the followers Math.ceil (FOLLOWER_NUM/20) *20 for I In range (0, Math.ceil (FOLLOWER_NUM/20) *: Post_data = {' method ': ' Next ', ' params ': Json.dumps ({ "Offset": I, "order_by": "Created", "hash_id": hash_id})} try:j = Self.session.post (Post_url, Params=post_ Data, Headers=self.headers,timeout=35). Text.encode (' latin-1 '). Decode (' unicode-escape ') pattern = Re.compile (r "class=\" Zm-item -link-avatar\ "[^\"]*\ "([^\"]*) ", Re. Dotall) J = Pattern.findall (j) for user in J:user = user.replace (' \ \ ', ') self.add_wait_user
      (user) # Save to Redis except Exception as Err:print (get in focus on failure) print (ERR) traceback.print_exc ()
  Pass # Get attention list def get_all_following (self, name_url): Following_page = Self.get_following_page (name_url) # to determine whether to get to the page If not following_page:return BS = BeautifulSoup (following_page, ' Html.parser ') # Get the number of followers following_num = Int ( Bs.find (' span ', text= ' attention '). Find_parent (). Find (' strong '). Get_text ()) # Get the user's hash_id hash_id = \ json.loads (bs.sele CT ("#zh-profile-follows-list") [0].select (". Zh-general-list") [0].get (' Data-init ')] [' params '] [' hash_id '] # Received Take the list of followers self.get_xsrf (following_page) # get xsrf Post_url = ' Https://www.zhihu.com/node/ProfileFolloweesListV2 ' # start getting all the followers Math.ceil (FOLLOWER_NUM/20) *20 for I in range (0, Math.ceil (FOLLOWING_NUM/20) *): Post_data
    = {' method ': ' Next ', ' params ': json.dumps ({' Offset ': I, ' order_by ': ' Created ', ' hash_id ': hash_id})} Try:j = Self.session.post (Post_url, Params=post_data, Headers=self.headers, timeout=35). Text.encode (' Lat In-1 '). Decode (' unicode-escape ') pattern = Re.compile (r "class=\" Zm-item-link-avatar\ "[^\"]*\ "(^\"]* "), RE.D Otall) J = Pattern.findall (j) for user in J:user = user.replace (' \ \ ', ') Self.add_wait_user ( User) # Save to Redis except Exception as Err:print (get in focus on failure) print (ERR) traceback.print_exc () p
 Ass

Invoke the known API, get all the attention to the user list and the list of fan users, recursively get users
here is to note that the head should remember to bring the XSRF or it will throw 403

# Analyze About page, get user details Def get_user_info (self, name_url): About_page = Self.get_user_page (name_url) # to determine whether to get to the page if no T About_page:print ("Get User Details page failed, skip, Name_url:" + name_url) return self.get_xsrf (about_page) BS = BeautifulSoup ( About_page, ' Html.parser ') # get the specific data of the page Try:nickname = Bs.find ("A", class_= "name"). Get_text () If Bs.find ("a", Clas s_= "name") Else ' User_type = Name_url[1:name_url.index ('/', 1)] Self_domain = Name_url[name_url.index ('/', 1) + 1:] gender = 2 if Bs.find ("I", class_= "icon Icon-profile-female") Else (1 if bs.find ("I", class_= "icon Icon-profile-ma")  Le ") Else 3" follower_num = Int (bs.find (' span ', text= ' followers '). Find_parent (). Find (' strong '). Get_text () Following_num = Int (Bs.find (' span ', text= ' attention '). Find_parent (). Find (' strong '). Get_text ()) agree_num = Int (Re.findall (R ' <strong& gt; (. *) </strong>.* approval ', About_page] [0]) appreciate_num = Int (Re.findall (R ' <strong> (. *) </strong>.* Thanks ', about_page] [0]) Star_num = Int (Re.findall (R ' <strong> (. *) </strong>.* collection ', about_page) [0]) share_num = Int (Re.findall (R ' < Strong> (. *) </strong>.* sharing ', about_page) [0]) browse_num = Int (Bs.find_all ("span", class_= "Zg-gray-normal") [2
                                       ].find ("Strong"). Get_text ()) trade = Bs.find ("span", class_= "Business item"). Get (' title ') If Bs.find ("span", class_= "Business item") Else ' company = Bs.find ("span", class_= "Employment item"). Get ( ' title ') If Bs.find ("span", class_= "Employment item") Else ' school = Bs.fin D ("span", class_= "Education item"). Get (' title ') If Bs.find ("span", class_= "Educati
                                           On item "] Else ' major = Bs.find (" span ", class_=" Education-extra item "). Get (' title ') If Bs.find (" span ", class_= "Education-extra item") Else ' job = Bs.find ("span", class_= "position item"). Get
      _text () If Bs.find ("span"),                                class_= "Position item") Else ' location = Bs.find ("span", class_= "Location item"). Get (' title ') If Bs.find ("span", class_= "Location item") Else ' description = BS . Find ("div", class_= "bio-ellipsis"). Get (' title ') if Bs.find ("div", class_= "Bio E Llipsis ') Else ' ask_num = Int (Bs.find_all ("A", class_= ' item ') [1].find ("span"). Get_text ()) if \ Bs.find_all ("a" , class_= ' item '] [1] Else int (0) answer_num = Int (Bs.find_all ("A", class_= ' item ') [2].find ("span"). Get_text ()) If \ Bs.find_all ("A", class_= ' item ') [2] else int (0) article_num = Int (Bs.find_all ("A", class_= ' item ') [3 ].find ("span"). Get_text ()) if \ Bs.find_all ("A", class_= ' item ') [3] else int (0) collect_num = Int (Bs.find_all ("a")  , class_= ' item ') [4].find ("span"). Get_text ()) if \ Bs.find_all ("A", class_= ' item ') [4] else int (0) Public_edit_num = Int (Bs.find_all ("a"),class_= ' item ') [5].find ("span"). Get_text ()) if \ Bs.find_all ("A", class_= ' item ') [5] Else int (0) Replace_data = \ (Pymysql.escape_string (Name_url), nickname, Self_domain, User_type, Gender, Follower_num, Following_num, Agre E_num, Appreciate_num, Star_num, Share_num, Browse_num, trade, company, school, Major, job, location, Pymysql.escap  E_string (description), Ask_num, Answer_num, Article_num, Collect_num, public_edit_num) replace_sql = ' Replace into user (Url,nickname,self_domain,user_type, gender, follower,following,agree_num,appreciate_num,s Tar_num,share_num,browse_num, Trade,company,school,major,job,location,description, Ask_num,answer_nu M,article_num,collect_num,public_edit_num) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,% s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) ' Try:print ("Get Data:") print (Replace_data) self.db_ Cursor.execute (replace_sQL, Replace_data) self.db.commit () except Exception as Err:print ("Insert Database Error") print ("Get Data:") p Rint (replace_data) print (INSERT statement: + self.db_cursor._last_executed) self.db.rollback () print (ERR) TR Aceback.print_exc () except Exception as Err:print ("Get data error, skip user") Self.redis_con.hdel ("Already_get_user", Name_u
 RL) Self.del_already_user (name_url) print (ERR) traceback.print_exc () Pass

Finally, to the user's about page, analyze the page elements, using regular or Beatifulsoup to analyze the data to crawl the page
Here we have SQL statements with replace into instead of insert into, which is a good way to prevent duplication of data problems

# Start crawling user, program total entry
def entrance (self): while
  1:
    if int (Self.redis_con.llen ("User_queue")) < 1:
      Self.get_index_page_user ()
    else:
      # out of queue get user Name_url Redis out is byte, to decode utf-8
      name_url = str ( Self.redis_con.rpop ("User_queue"). Decode (' Utf-8 ')
      print ("Processing name_url:" + name_url)
      Self.get_user_info ( Name_url)
      if int (Self.redis_con.llen ("User_queue")) <= Int (self.max_queue_len):
        Self.get_all_ Follower (Name_url)
        self.get_all_following (name_url)
    Self.session.cookies.save ()
def run (self):
  Print (Self.name + "is running")
  self.entrance ()

Finally, the entrance

if __name__ = = ' __main__ ':
  login = GetUser (999, "Landing thread")
  threads = []
  for I in range (0, 4):
    m = GetUser (i, "Thread" + str (i))
    threads.append (m)
  for I in range (0, 4):
    Threads[i].start () to
  I in range (0, 4):
    Threads[i].join ()

Here is the multithreading of the open, need to open how many threads to change the number of 4 can be

Docker

The trouble can refer to me with Docker simple to build a basic environment:

Both MySQL and Redis are official mirrors.

Docker run--name mysql-itd mysql:latest
Docker run--name REDIS-ITD

And then using Docker-compose to run the python mirror, my python docker-compose.yml:

Python:
 container_name:python build
 :.
 Ports:
  -"84:80"
 external_links:
  -Memcache:memcache
  -mysql:mysql
  -Redis:redis
 Volumes:
  -/docker_containers/python/www:/var/www/html
 tty:true
 stdin_open:true
 extra_hosts :
  -"python:192.168.102.140"
 Environment:
  pythonioencoding:utf-8

Finally attached source code: GITHUB Https://github.com/kong36088/ZhihuSpider

Site Download Address: Http://xiazai.jb51.net/201612/yuanma/ZhihuSpider (jb51.net). zip

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More