This is a python crawl csdn download Resource Information example, mainly through the urllib2 to obtain csdn a person all resources resource URL, resource name, download number, score and other information; The reason I wrote this article is that I want to get all the comments on my resources, However, because the comments are temporarily loaded with JS, this article first briefly describes how to manually parse the HTML page crawl information.
Source
# coding=utf-8 Import urllib import time import re import os#************************************************** #第一步 times Calendar gets the URL of each page corresponding to the topic #http://download.csdn.net/user/eastmount/uploads/1#http://download.csdn.net/user/eastmount/ Uploads/8#**************************************************num=1 #记录资源总数 A total of 46 resources Number=1 #记录列表总数1 -8fileurl=open (' Csdn_url.txt ', ' w+ ') fileurl.write (' **************** get Resource url*************\n\n ') while Number<9:url= '/HTTP download.csdn.net/user/eastmount/uploads/' + str (number) fileurl.write (' Download list url: ' +url+ ' \ n ') print Unicode (' Download list ur L: ' +url, ' Utf-8 ') content=urllib.urlopen (URL). Read () Open (' csdn.html ', ' w+ '). Write (content) #获取包含URL块内容 match needs to be calculated </ Div> number Start=content.find (R ' <div class= "List-container mb-bg" > ') end=content.find (R ' <div class= "Page_n AV ">") cutcontent=content[start:end] #print cutcontent #获取块内容中URL #形如 <dt><div>< ;/div>
Show Results
The display includes the resource URL, resource title, resource credits, download count, resource type, and resource size:
For example, now crawl Guo Lin The resources of the Great God, where the page links are as follows: (total 7 pages)
Http://download.csdn.net/user/sinyu890807/uploads/1
Http://download.csdn.net/user/sinyu890807/uploads/7
After simply modifying the python source code URL, the download page looks like this:
the results of the operation are as follows:
html Analysis
First, get the URL and title of all the resources in each column, by parsing the source code.
<dt> <div class= "icon" ></div> <div class= "Btns" ></div>
the corresponding HTML appears as shown in the following:
then through the URL to the specific resources to get what I call the message box information:
the information corresponding to the review element is as follows, get <span>0 </span>:
The last thing I want to do is get comment information, but it is implemented by JS:
<div class= "section-list panel Panel-default" > <div class= "panel-heading" >
finally hope that the article is helpful to you! Next prepare to analyze how Python gets the comments of JS, and this article can provide you with a simple manual analysis of the page example, you can also get a person csdn resources download more, score high for you to choose. Basic knowledge, for reference only ~
(By:eastmount 2015-7-21 5 o'clock in the afternoonhttp://blog.csdn.net/eastmount/)
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
[Python learning] Simple crawl csdn Download resource information