User Information crawling on Sina Weibo search page, Sina Weibo user information

Source: Internet
Author: User

User Information crawling on Sina Weibo search page, Sina Weibo user information

After successful login, we can perform the next operation ~

Next, we will enter keywords to find the relevant users and collect some basic information about the users.

 

Environment

Tools

1. chrome and Its developer tools

2. python3.6

3. pycharm

 

Library Used in Python3.6

 1 import urllib.error 2 import urllib.request 3 import urllib.parse 4 import urllib 5 import re 6 import json 7 import pandas as pd 8 import time 9 import logging10 import random11 from lxml import etree

 

Keyword Search

First, enter keywords on the Weibo homepage to go to the search page.

 

After searching, we found the URL is http://s.weibo.com/user/%25E4%25B8%258A%25E8%25AE%25BF&Refer=weibo_user

Obviously

On the second page, we found the web site changed to http://s.weibo.com/user/%25E4%25B8%258A%25E8%25AE%25BF&page=2

At this time, we need to guess whether the number after "page =" will jump back to the home page if it is changed to 1:

Then, we can divide the url into three parts.

 

Now, we need to parse the intermediate encoding and find that it has undergone two url encoding.

Therefore, the url Connection on the final search page is as follows:

1 import urllib. request2 3 keyword = 'petition '4 once = urllib. request. quote (keyword) 5 pagecode = urllib. request. quote (once) 6 7 I = 1 # page number 8 url = 'HTTP: // s.weibo.com/user/' + pagecode + '& page =' + str (I)

 

Basic User Information Extraction

Next, I want to crawl some basic user information.

After observation, the user information fields that can be crawled on the search page are preliminarily determined:

  • Weibo name -- name
  • Region -- location
  • Gender
  • Weibo address -- weibo_addr
  • Follow -- follow
  • Fans -- follower
  • Weibo count -- weibo_num
  • Introduction-intro
  • Occupation-occupation Information
  • Education-Education Information
  • Tag -- tag

(The Red field is mandatory .)

We will first crawl the mandatory field, and then study how to crawl the non-mandatory field ].

 

First, we first observe the web page source code and locate the user's main part [You can use the element in Chrome's developer tools to locate it]

As a result, we found that the start is "<script> STK & STK. pageletM & STK. pageletM. view ({"pid": "pl_user_feedList" ", we can use a regular expression to find the content

In addition, you can also find that the content in () is in json format.

Long live ~~

This is very helpful for us to extract the required html content ~~

1 data = urllib. request. urlopen (url, timeout = 30 ). read (). decode ('utf-8') 2 3 lines = data. splitlines () 4 for line in lines: 5 if not line. startswith ('<script> STK & STK. pageletM & STK. pageletM. view ({"pid": "pl_user_feedList", "js": '): 6 continue 7 8 json_pattern = re. compile ('\((. *) \) ') 9 # use the regular expression to retrieve json10 json_data = json_pattern.search (line ). group (1) 11 # wrap the json into a dictionary and extract the html content 12 html = json. loads (json_data) ['html']

 

Then, start the official webpage content parsing and parse the mandatory fields first.

Here, we still need to use developer tools on Chrome

Through elements, we locate the html content corresponding to the Weibo name. We can use etree and xpath in lxml to obtain the title name.

Similarly, we can obtain other mandatory fields. Finally, we will store all the content into a dict, and the value corresponding to each key is a list:

1 page = etree.HTML(html)2 info_user['name'] = page.xpath('//a[@class="W_texta W_fb"]/@title')3 info_user['weibo_addr'] = page.xpath('//a[@class="W_texta W_fb"]/@href')4 info_user['gender'] = page.xpath('//p[@class="person_addr"]/span[1]/@title')5 info_user['location'] = page.xpath('//p[@class="person_addr"]/span[2]/text()')6 info_user['follow'] = page.xpath('//p[@class="person_num"]/span[1]/a/text()')7 info_user['follower'] = page.xpath('//p[@class="person_num"]/span[2]/a/text()')8 info_user['weibo_num'] = page.xpath('//p[@class="person_num"]/span[3]/a/text()')

 

Finally, it crawls [non-mandatory fields ].

The logic of thinking code for crawling this type of field is as follows:

Use etree in the lxml package to capture the subtree (class = "person_detail ")

Traverse the branches under the tree to determine whether there is a brief introduction (class = "person_info") and a label (class = "person_label ").

It is worth noting that there are more than one content under some tags, so we must make a judgment and traverse of the TAG content.

Because the introduction and tags are in two branches, You can edit two different functions for extraction:

1 # extract profile information 2 def info (self, p, path): 3 ''' 4 extract introduction of users 5: param p: input an etree 6: param path: input xpath which must be a string 7: return: a string 8 ''' 9 if type (path) = str: 10 info = p. xpath (path) 11 if len (info) = 0: 12 sub_info = ''13 else: 14 sub_info = info [0] 15 16 return sub_info17 else: 18 print ('Please enter the path as a string') 19 20 # extract Tag Information: Tag, education information, work information 21 def la Bels (self, p, path, key): 22 ''' 23 extract labels, such as hobbits, education, job, of users24: param p: input an etree25: param path: input xpath which must be a string26: param key: keywords of labels27: return: a string28 ''' 29 label = p. xpath (path) 30 if len (label) = 0: 31 sub_label = ''32 else: 33 for l in label: 34 label_name = re. compile ('(. *?) :'). Findall (l. xpath ('. /text () ') [0]) 35 if label_name [0] = key: 36 # Read all TAG content under tag information 37 all_label = l. xpath ('. /a/text () ') 38 l = ''39 for I in all_label: 40 l = re. compile ('\ n \ t (. *?) \ N \ t'). findall (I) [0] + ',' + l41 sub_label = l42 else: 43 sub_label = ''44 45 return sub_label

 

After constructing the function, you can extract information of all users ~

It should be noted that the returned content is a string under a single subtree. to traverse the information of all users on the current page, you need to make it a list:

1 info_user ['intro'] = [] 2 info_user ['tag'] = [] 3 info_user ['career information'] = [] 4 info_user ['educational information'] = [] 5 others = page. xpath ('// div [@ class = "person_detail"]') 6 for p in others: 7 path1 = '. /div/p/text () '8 info_user ['intro']. append (self.info (p, path1) 9 path2 = '. /p [@ class = "person_label"] '10 info_user ['tag']. append (self. labels (p, path2, 'tag') 11 info_user ['career information']. append (self. labels (p, path2, 'career information') 12 info_user ['educational information']. append (self. labels (p, path2, 'educational information '))

 

Traverse all pages

After the basic user information is successfully crawled, You need to traverse all webpages.

Here, we can use a very stupid method to observe a total of several pages, and then compile a for Loop

Rangoose !!! I will never do such stupid things !!! You must have a more x-loaded Method for traversal!

 

Therefore, we continue to rely on developer tools and find the corresponding elements on the next page. We found a magic thing -- class = "page next S_txt1_S_line1"

This is not a perfect location for the next page of the artifacts !!!

So, the best code logic was born ~~

Determine whether the content can be extracted to determine whether to enter the next page I + = 1

1 I = 1 2 Flag = True 3 while Flag: 4 # Build url 5 url = 'HTTP: // s.weibo.com/user/' + pagecode + '& page =' + str (I) 6 try: 7 # timeout setting 8 data = urllib. request. urlopen (url, timeout = 30 ). read (). decode ('utf-8') 9 parse t Exception as e: 10 print ('exception --> '+ str (e) 11 12 ............ 13 14 next_page = page. xpath ('// a [@ class = "page next S_txt1 S_line1"]/@ href') [0] 15 if len (next_page) = 0: 16 Flag = False17 else: 18 page_num = re. compile ('page = (\ d *)'). findall (next_page) [0] 19 I = int (page_num)

 

Sahua ~~~ The overall logic of the Code is so complete ~~~

Finally, we need to solve the anti-crawling problem. The bloggers have not studied this problem in depth.

However, there are some main ideas:

 

The full version of the Code will not be posted ~~ If you have any questions, please reply below for more information ~

End of flowers ~~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.