Python implements crawling statistics on the ratio of men and women in the school BBS (1) and the ratio of men and women in python

Last Update:2016-01-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. project requirements

Each user id on the BBS corresponds to a user. During registration, the user will enter the gender (male, female, and confidentiality ).

After checking, the id of the BBS registered user corresponds to 1-300000, which is about 0.3 million of the users.

I want to use Python to count the number of registered users on BBS and the gender distribution of these users.

You can count the number of users in the latest activity, including the numbers of male, female, and confidential users.

The active user is limited to "last activity time" as 2015

Ii. Final Results

The gender information is saved in the text, and one row represents the information of a user, and each column represents
[Number of rows, id (coated), gender, and last active time]

Iii. Implementation ideas

On which page does gender information belong?

Get the following personal homepage

Change uid = 256730 to another number to get the homepage of another person.

In addition, if the above link cannot obtain the gender, you can use this link again. You can also modify the uid to access other people's home pages.

Http://rs.xidian.edu.cn/home.php? Mod = space & uid = 256730 & do = profile

4. How to store data?

Using databases or other solutions?

For ease of reading, we consider using text file storage.

0.3 million of users are stored in a single text, resulting in too much text. If the program is accidentally terminated, 0.3 million of user data needs to be crawled again.

We consider storing 1000 records in one text. In theory, we can use 30 texts to store 0.3 million data records.

Name correct1-1001.txt correct47001-48001.txt, note: 1-1001 is [1001), contains 1, does not contain

1. Use Regular Expression matching to find the Gender

View webpage source code

<! -- Locate the Gender column --> <li> <em> gender </em> female </li> can also find the activity time --> <li> <em> last posting time </em> 2015-11-4 20:04 </li> <! -- If some IDs do not exist, a prompt is displayed. --> <p> sorry, the specified user space does not exist. </p>

We can use the re module for regular expression matching.

sexRe = re.compile(u'em>\u6027\u522b</em>(.*?)</li')timeRe = re.compile(u'em>\u4e0a\u6b21\u6d3b\u52a8\u65f6\u95f4</em>(.*?)</li')notexistRe = re.compile(u'(p>)\u62b1\u6b49\uff0c\u60a8\u6307\u5b9a\u7684\u7528\u6237\u7a7a\u95f4\u4e0d\u5b58\u5728<')

For Chinese reasons, the Unicode Conversion Tool is required. You can use the webmaster tool Unicode to convert ASCII and ASCII to Unicode. For example, the following link:Http://tool.chinaz.com/Tools/Unicode.aspx

Unicode of gender is \ u6027 \ u522b
Last activity \ u4e0a \ u6b21 \ u6d3b \ u52a8 \ u65f6 \ u95f4
Sorry, the specified user space does not exist.
\ U62b1 \ u6b49 \ uff0c \ u60a8 \ u6307 \ u5b9a \ u7684 \ u7528 \ u6237 \ u7a7a \ u95f4 \ u4e0d \ u5b58 \ u5728

Here is a simple way to get the gender Source Code. Use urllib2 to send a get request to the link myurl and save the obtained html. Pay attention to the encoding problem unicode (html, 'utf-8'), and then match the seWord for the html regular expression.

If the user has gender information, the corresponding gender is returned; otherwise, None is returned.

# Search for seWord matching on the myurl page # seWord uses unicode to indicate def getInfo (myurl, seWord): headers = {'user-agent': 'mozilla/5.0 (Windows; U; windows NT 6.1; en-US; rv: 1.9.1.6) Gecko/20091201 Firefox/3.5.6 '} req = urllib2.Request (url = myurl, headers = headers) time. sleep (0.3) response = urllib2.urlopen (req) html = response. read () html = unicode (html, 'utf-8') # encoding is required. Otherwise, timeMatch = seWord cannot be found. search (html) # Because seWord uses unicode to indicate if timeMatch: s = timeMatch. groups () return s [0] else: return None

V. Error Handling

1. Network disconnection (hotfix solution)

It may take several days for crawlers to use campus networks.
If a network disconnection is found, we can re-connect to the Internet, which is not obtained by some users.
It takes a long time for the program to run. It is unscientific to run the program again from id = 1 when the network is disconnected once. And you cannot guarantee that this time the Internet is always good.
To prevent the program from re-starting, we recorded the user ID missed in the disconnected network.
Wait a few days until the program is run out, and the recorded id is run again.

2. Unable to obtain gender

There are two cases:
First, there is really no gender (the user has not entered it)
The second is that the server is exhausted and we fail to request the webpage.
This is also similar to the above. Record the failed id and re-run it later.

Knowledge Point Summary

For this error, SyntaxError: Non-ASCII character '\ xe5' in file

You need to add #-*-coding: UTF-8-*-at the beginning of the file -*-

Because the default python encoding file uses the ANSCII code

<Meta http-equiv = "Content-Type" content = "text/html; charset = UTF-8"/>
The source code of the BBS webpage is encoded using UTF-8.

This project is designed to contain Chinese characters and will naturally encounter Encoding Problems.

import sysprint sys.getdefaultencoding()

OutputAscii
From the code above, we can see that sys. defaultencoding is ANSCII and ANSCII cannot encode Chinese characters. UTF-8 is one of the Unicode implementations, can encode Chinese characters. When encountering Chinese characters, we need to add this line of code

reload(sys)# sys.setdefaultencoding('utf-8')

Change sys. defaultencoding to 'utf-8'

Later, I found a small problem. Because the regular expression was represented by unicode, I needed to convert html to unicode for searching.
Later, we found that we could use Chinese characters to search for the original html.

#-*-Coding: UTF-8-*-html = response. read () sexRe = re. compile ('em> gender </em> (.*?) </Li ') timeMatch = sexRe. search (html) if timeMatch: s = timeMatch. groups () print "string" + s [0] html = unicode (html, 'utf-8') sexRe = re. compile (u'em> \ u6027 \ u522b </em> (. *?) </Li ') timeMatch = sexRe. search (html) if timeMatch: s = timeMatch. groups () print "unicode" + s [0]

Output

String female unicode female html = response. read () print len (html) html = unicode (html, 'utf-8') # print len (html)

Output

html = response.read()  print len(html)  html = unicode(html, 'utf-8') #  print len(html)

Output

3542333658

The above is python's preliminary preparation and Solution Analysis for counting the ratio of BBS to men and women in the school. I hope it will be helpful for everyone's learning.

Articles you may be interested in:

Python implements Regular Expressions for counting Chinese and English Words
Example of using python to count the number of rows
Example of using a dictionary to count the number of words or Chinese Characters in python
How to count the number of times repeated items appear in the Python list
How does python calculate the number of repeated lines in a text file?
How does one count the number of occurrences of a specified character in a string in python?
Python implements multi-thread crawling for counting the ratio of BBS to men and women in schools (2)
Python implements crawling statistics on the ratio of BBS to men and women in schools (3)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python implements crawling statistics on the ratio of men and women in the school BBS (1) and the ratio of men and women in python

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support