Python captures webpage content
Recently, I want to capture data on the Internet for research. Just a bit of python, let's look at a simple implementation method.
For example, I want to capture Obama's weekly speech.
Is there a one-step approach that can be quickly implemented using a powerful language such as python.
First, let's look at the source code of this webpage.
We can find that the information we need is in such a small url. <喎?http: www.bkjia.com kf ware vc " target="_blank" class="keylink"> VcD4KPHA + uPy + Environment + NKz1tDM4cihoaM8L3A + Environment + ytfPyLTyv6rV4rj2xL/environment = "brush: java;"> import sys, urlliburl = "http://www.putclub.com/html/radio/VOA/presidentspeech/index.html" wp = urllib. urlopen (url) print "start download... "content = wp. read ()Next we will extract the content of each speech.
The specific idea is to search for the content between each "href =" and "target" after "center_box. See the source code of the webpage.
The result is the url of each article, and www.putclub.com is the url of each article.
print content.count("center_box")index = content.find("center_box")content=content[content.find("center_box")+1:]content=content[content.find("href=")+7:content.find("target")-2]filename = contenturl ="http://www.putclub.com/"+contentprint contentprint urlwp = urllib.urlopen(url)print "start download..."content = wp.read()
With the url of the article content, you can filter the content in the same way.
# Print contentprint content. count (""):] content = content [: content. find ("Save and print it againfilename = filename.replace('/',"-",filename.count("/"))fp = open(filename,"w+")fp.write(content)fp.close()print content
OK, all done! Save it as the. pyw file. In the future, you only need to double-click it to directly Save the content of obama's weekly speech ~