Hey, before the time to write Bowen, some pictures with QQ to intercept, get the picture file name is similar to the QQ screenshot 20120926174732-300x15.png form, yesterday, using FTP backup site files found that the Chinese name in the FLASHFXP inside the display is garbled , it looks so uncomfortable, so write a Python script, crawl the entire site, and then get the picture name of each article page, and judge if it is similar to the QQ screenshot 20120926174732-300x15. PNG in the form of the output and the picture address and the corresponding article address in the file, and then through the file to be modified individually.
Okay, here's The program code:
Import urllib2 from BS4 import beautifulsoup import re import sys reload (SYS) sys.setdefaultencoding (' utf-8 ') BaseURL = "Http://www.jb51.net/dont-worry.html" #说明下, the starting address is the address of the first article, through the page of the article #可以使用BeautifulSoup模块来获取上一篇文章的地址 file = open (r "E : \123.txt "," a ") def pageloop (URL): page = urllib2.urlopen (URL) soup = beautifulsoup (page) img = soup.findall ([' img ']) if img = []: print "The current page does not have a picture" return else:for myimg in img:link = Myimg.get (' src ') prin
T link pattern = re.compile (R ' qq\s*[0-9]*png ') badimg = pattern.findall (str (link) if badimg: Print URL file.write (link + "\ n") file.write (url+ "\ n") def getthenextpage (URL): Pageloop (URL) page = urllib2.urlopen (URL) soup = beautifulsoup (page) for Spanclass in Soup.findall (attrs={"class": "article -nav-prev "}): #print spanclass if Spanclass.find (' Article-nav-prev ')!= -1:pattern = Re.compile (R ' http://ww W.jb51.net/\s*html ') PAGeurl = Pattern.findall (str (spanclass)) for I in Pageurl: #print i getthenextpage (i)
Getthenextpage (baseurl) print "The end!" File.close ()
Finally, to the students and I have just started to do the site, say, the name of the picture is best to use digital form or English, pinyin form, or to the last want to change the words on the trouble, so it is best to form a good habit from the beginning, with the correct naming norms to ask articles, pictures to name, so it will be much better.