Python implements crawling statistics on the ratio of men and women in the school BBS (1) and the ratio of men and women in python
I. project requirements
Each user id on the BBS corresponds to a user. During registration, the user will enter the gender (male, female, and confidentiality ).
After checking, the id of the BBS registered user corresponds to 1-300000, which is about 0.3 million of the users.
I want to use Python to count the number of registered users on BBS and the gender distribution of these users.
You can count the number of users in the latest activity, including the numbers of male, female, and confidential users.
The active user is limited to "last activity time" as 2015
Ii. Final Results
The gender information is saved in the text, and one row represents the information of a user, and each column represents
[Number of rows, id (coated), gender, and last active time]
Iii. Implementation ideas
On which page does gender information belong?
Get the following personal homepage
Change uid = 256730 to another number to get the homepage of another person.
In addition, if the above link cannot obtain the gender, you can use this link again. You can also modify the uid to access other people's home pages.
Http://rs.xidian.edu.cn/home.php? Mod = space & uid = 256730 & do = profile
4. How to store data?
Using databases or other solutions?
For ease of reading, we consider using text file storage.
0.3 million of users are stored in a single text, resulting in too much text. If the program is accidentally terminated, 0.3 million of user data needs to be crawled again.
We consider storing 1000 records in one text. In theory, we can use 30 texts to store 0.3 million data records.
Name correct1-1001.txt correct47001-48001.txt, note: 1-1001 is [1001), contains 1, does not contain
1. Use Regular Expression matching to find the Gender
View webpage source code
<! -- Locate the Gender column --> <li> <em> gender </em> female </li> can also find the activity time --> <li> <em> last posting time </em> 2015-11-4 20:04 </li> <! -- If some IDs do not exist, a prompt is displayed. --> <p> sorry, the specified user space does not exist. </p>
We can use the re module for regular expression matching.
sexRe = re.compile(u'em>\u6027\u522b</em>(.*?)</li')timeRe = re.compile(u'em>\u4e0a\u6b21\u6d3b\u52a8\u65f6\u95f4</em>(.*?)</li')notexistRe = re.compile(u'(p>)\u62b1\u6b49\uff0c\u60a8\u6307\u5b9a\u7684\u7528\u6237\u7a7a\u95f4\u4e0d\u5b58\u5728<')
For Chinese reasons, the Unicode Conversion Tool is required. You can use the webmaster tool Unicode to convert ASCII and ASCII to Unicode. For example, the following link:Http://tool.chinaz.com/Tools/Unicode.aspx
- Unicode of gender is \ u6027 \ u522b
- Last activity \ u4e0a \ u6b21 \ u6d3b \ u52a8 \ u65f6 \ u95f4
- Sorry, the specified user space does not exist.
- \ U62b1 \ u6b49 \ uff0c \ u60a8 \ u6307 \ u5b9a \ u7684 \ u7528 \ u6237 \ u7a7a \ u95f4 \ u4e0d \ u5b58 \ u5728
Here is a simple way to get the gender Source Code. Use urllib2 to send a get request to the link myurl and save the obtained html. Pay attention to the encoding problem unicode (html, 'utf-8'), and then match the seWord for the html regular expression.
If the user has gender information, the corresponding gender is returned; otherwise, None is returned.
# Search for seWord matching on the myurl page # seWord uses unicode to indicate def getInfo (myurl, seWord): headers = {'user-agent': 'mozilla/5.0 (Windows; U; windows NT 6.1; en-US; rv: 1.9.1.6) Gecko/20091201 Firefox/3.5.6 '} req = urllib2.Request (url = myurl, headers = headers) time. sleep (0.3) response = urllib2.urlopen (req) html = response. read () html = unicode (html, 'utf-8') # encoding is required. Otherwise, timeMatch = seWord cannot be found. search (html) # Because seWord uses unicode to indicate if timeMatch: s = timeMatch. groups () return s [0] else: return None
V. Error Handling
1. Network disconnection (hotfix solution)
- It may take several days for crawlers to use campus networks.
- If a network disconnection is found, we can re-connect to the Internet, which is not obtained by some users.
- It takes a long time for the program to run. It is unscientific to run the program again from id = 1 when the network is disconnected once. And you cannot guarantee that this time the Internet is always good.
- To prevent the program from re-starting, we recorded the user ID missed in the disconnected network.
- Wait a few days until the program is run out, and the recorded id is run again.
2. Unable to obtain gender
- There are two cases:
- First, there is really no gender (the user has not entered it)
- The second is that the server is exhausted and we fail to request the webpage.
- This is also similar to the above. Record the failed id and re-run it later.
Knowledge Point Summary
For this error, SyntaxError: Non-ASCII character '\ xe5' in file
You need to add #-*-coding: UTF-8-*-at the beginning of the file -*-
Because the default python encoding file uses the ANSCII code
<Meta http-equiv = "Content-Type" content = "text/html; charset = UTF-8"/>
The source code of the BBS webpage is encoded using UTF-8.
This project is designed to contain Chinese characters and will naturally encounter Encoding Problems.
import sysprint sys.getdefaultencoding()
OutputAscii
From the code above, we can see that sys. defaultencoding is ANSCII and ANSCII cannot encode Chinese characters. UTF-8 is one of the Unicode implementations, can encode Chinese characters. When encountering Chinese characters, we need to add this line of code
reload(sys)# sys.setdefaultencoding('utf-8')
Change sys. defaultencoding to 'utf-8'
Later, I found a small problem. Because the regular expression was represented by unicode, I needed to convert html to unicode for searching.
Later, we found that we could use Chinese characters to search for the original html.
#-*-Coding: UTF-8-*-html = response. read () sexRe = re. compile ('em> gender </em> (.*?) </Li ') timeMatch = sexRe. search (html) if timeMatch: s = timeMatch. groups () print "string" + s [0] html = unicode (html, 'utf-8') sexRe = re. compile (u'em> \ u6027 \ u522b </em> (. *?) </Li ') timeMatch = sexRe. search (html) if timeMatch: s = timeMatch. groups () print "unicode" + s [0]
Output
String female unicode female html = response. read () print len (html) html = unicode (html, 'utf-8') # print len (html)
Output
html = response.read() print len(html) html = unicode(html, 'utf-8') # print len(html)
Output
3542333658
The above is python's preliminary preparation and Solution Analysis for counting the ratio of BBS to men and women in the school. I hope it will be helpful for everyone's learning.
Articles you may be interested in:
- Python implements Regular Expressions for counting Chinese and English Words
- Example of using python to count the number of rows
- Example of using a dictionary to count the number of words or Chinese Characters in python
- How to count the number of times repeated items appear in the Python list
- How does python calculate the number of repeated lines in a text file?
- How does one count the number of occurrences of a specified character in a string in python?
- Python implements multi-thread crawling for counting the ratio of BBS to men and women in schools (2)
- Python implements crawling statistics on the ratio of BBS to men and women in schools (3)