Demand
Crawl 2018 Sichuan University self-admission first instance through list information.
Have knowledge
1. Regular Expressions.
2.python basic syntax, crawler and database operations.
Operation
1. Crawl the Web page.
2. Parse out the required data.
3. Continue scratching the page and repeat the 12 steps until the last stop.
4. Store the parsed data in the database.
Instance
Using python3.6 and MySQL
Import Urllib.requestimport reimport pymysqldef catch_page (url_addr): try:page_data = Urllib.request.urlopen (ur L_ADDR). Read () except Urllib. Urlerror as E:if hasattr (E, ' code '): Print (' server cannot accept request error code: ', E.code) elif hasattr (E, ' reason '): Print (' Unable to reach server, check URL and read reason!\n reason: ', E.reason) return page_datadef find_all_data (HTML): pattern = "<tr>[ \s\s]*?<td> (. *?) </td> "" [\s\s]*?<td> (. *?) </td> "" [\s\s]*?<td> (. *?) </td> "" [\s\s]*?<td> (. *?) </td>[\s\S]*?</tr> "#[\s\s]*" matches any character UserData = Re.findall (pattern,html) return userdatadef Add_to_mysql ( USERDATAS): conn = Pymysql.connect (host= ' 127.0.0.1 ', port=3306, user= ' root ', passwd= ' root ', db= ' scdx_zzzs_db ', CharSet = ' UTF8 ') cursor = Conn.cursor () for userdata in userdatas:sql = "INSERT INTO student (name,sex,school,provinc e) VALUES ('%s ', '%s ', '%s ', '%s '); "% (Userdata[0], userdata[1],USERDATA[2], userdata[3]) try:cursor.execute (SQL) print ("√ Execution succeeded----->>" + SQL) E Xcept:print ("x execution failed----->>" + sql) Conn.commit () Cursor.close () conn.close () Userdatas = []i = 0w Hile i<=3300:url = "https://gaokao.chsi.com.cn/zzbm/mdgs/detail.action?oid=476754340&lx=1&start=%d"% i html = catch_page (URL). Decode () Userdatas.extend (Find_all_data (HTML)) print (i) i + = 30add_to_mysql (Userdatas)
After execution, the data is successfully stored in the database from the Web page fetch.
Python crawler Combat