Use python to crawl blogs and python to crawl blogs
Blog crawling with python by Wu xueying
Take the blog crawling Wang Yin as an example:
import reimport urllib2def getHtmlCode(url):return urllib2.urlopen(url).read()def findTitleUrl(htmlString): regTitleUrl = re.compile("href=\"(.+?)\"") return regTitleUrl.findall(htmlString)def findTitleContent(htmlString):regTitleContent = re.compile("\">(.+?)</a>")return regTitleContent.findall(htmlString)htmlCode = getHtmlCode('http://www.yinwang.org/')titleContent = findTitleContent(htmlCode)titleUrl = findTitleUrl(htmlCode)for i in range(0, len(titleUrl)):print titleContent[i+3]print titleUrl[i+8]htmlPage = getHtmlCode(titleUrl[i+8])f = open("%s.html"%(titleContent[i+3]),'wb')f.write(htmlPage)f.close
Python script Learning Process recommendation
Learning Process:
I. lay a good foundation
1. Find a suitable entry book (Python core programming 2 and Dive into Python are recommended), read it once, judge it cyclically, use common classes, and understand it (too difficult to skip)
2. Practice python exercises frequently (python core programming 2 has a large number of exercises after class)
3. Join the Python discussion group.
4. write a blog on the summary of python learning.
Ii. Start to use Python for daily work.
For example, Python searches for files, Python batch processing, and web crawlers.
Iii. start learning about Django, Flask, Tornado and other frameworks to develop some web applications.
----------------------------
Resource recommendation:
Concise Python tutorial
Learning programming with children
Head First Python Chinese edition
Stupid Way To Learn Python
Dive. Into. Python Chinese version (with course source code)
Python core programming
Deep understanding of Python
Python standard library
Python programming guide
Diango_book Chinese edition
For more information about the system, see the python official documentation and django official documentation. Learn, summarize, practice, and practice to learn python.
How does python capture csdn Blog content?
R = requests. get ('blog .csdn.net/u013055678 ')
In this case, anti-crawler protection of csdn is required.
R = requests. get ('blog .csdn.net/u013055678', headers?='user-agent': 'mozilla/5.0 (Windows NT 6.1; rv: 32.0) Gecko/20100101 Firefox/100 '})