Python parses the 115 Network Disk linked instance in the source code of the webpage, and python115
This article describes how to parse the 115 Network Disk link in the source code of a Web page using python. Share it with you for your reference. The specific method is analyzed as follows:
In the 1.txt, is the Web page http://bbs.pediy.com/showthread.php? T0000144788133 is 1.txt
The Code is as follows:
import re if __name__ == "__main__": fp = open("c:\\1.txt") https = re.compile(r"(http://u.*)") for url in https.findall(fp.read()): print url
Output result:
http://u.115.com/file/f61cb107c8 http://u.115.com/file/f6806f45b8 http://u.115.com/file/f6ec42d4d3 http://u.115.com/file/f6deb05ec4 http://u.115.com/file/f6e51f6838 http://u.115.com/file/f66edaf8d3 http://u.115.com/file/f6d07e07b9 http://u.115.com/file/f6d7f585a8 http://u.115.com/file/f639d8b3cf http://u.115.com/file/f6dcadbde6 http://u.115.com/file/f6ea3f01c1 http://u.115.com/file/f65b96a06f http://u.115.com/file/f682da085a http://u.115.com/file/f6486e698 http://u.115.com/file/f6b7491d9f http://u.115.com/file/f622b7f9a7 http://u.115.com/file/f64e2424b9 http://u.115.com/file/f6e5132d4d http://u.115.com/file/f655c10e86 http://u.115.com/file/f6b22e64e6 http://u.115.com/file/f6812126a4 http://u.115.com/file/f6523e625c http://u.115.com/file/f63e0ccb28 http://u.115.com/file/f611e07b8a# http://u.115.com/file/f6e047bccc# http://u.115.com/file/f6d348d781# http://u.115.com/file/f6ada24153# http://u.115.com/file/f64f97518b# http://u.115.com/file/f6f9ba96f8# http://u.115.com/file/f650e06f38# http://u.115.com/file/f683ee5b2a# http://u.115.com/file/f69009bfc2# http://u.115.com/file/f6ea427646# http://u.115.com/file/f6acdc6b7f# http://u.115.com/file/f6c85745d0# http://u.115.com/file/f61a26cf12# http://u.115.com/file/f631edf5c6# http://u.115.com/file/f6b0fa6fb8# http://u.115.com/file/f6f5fe8962# http://u.115.com/file/f6bf975e0# http://u.115.com/file/f6d522784c# http://u.115.com/file/f6b5ac9991# http://u.115.com/file/f62e80ced5# http://u.115.com/file/f6bff09c0c# http://u.115.com/file/f663fc4a54# http://u.115.com/file/blpk4pv1 http://u.115.com/file/c4rjotdz http://u.115.com/file/f6a960aca8# http://u.115.com/file/efnn38jr http://u.115.com/file/c4leomjd http://u.115.com/file/dlpw9s6i http://u.115.com/file/f6d3cbebe0# http://u.115.com/file/f6de8062b2# http://u.115.com/file/ef8og8la http://u.115.com/file/f6f6391ac6# http://u.115.com/file/f628d256ae# http://u.115.com/file/f66a049dc9# http://u.115.com/file/f62bf1750a# http://u.115.com/file/f642e47260# http://u.115.com/file/f693eb7c89# http://u.115.com/file/f6ed68ba9b# http://u.115.com/file/f6f099c3f9# http://u.115.com/file/f61ac19339# http://u.115.com/file/f6f3c78d2c# http://u.115.com/file/f6696f6348# http://u.115.com/file/f6e88eeefb# http://u.115.com/file/f66471e4eb# http://u.115.com/file/f672da54ae# http://u.115.com/file/dnasw0kp# http://u.115.com/file/dnagnndx# http://u.115.com/file/clwr2xxg# http://u.115.com/file/bhbcnnwe# http://u.115.com/file/aq2rp9ga# http://u.115.com/file/e601turs# http://u.115.com/file/dn46qs7x# http://u.115.com/file/clwonrwg# http://u.115.com/file/dn43i7jf# http://u.115.com/file/bhbgrnfz# http://u.115.com/file/dnsl0kxp#
I hope this article will help you with Python programming.
Python webpage capture code
Import urllib2
Print urllib2.urlopen ("www.baidu.com"). read ()
# Urllib2.urlopen () returns an object similar to a file, so read the content using read ().
# You can also save local files.
Outfile = open ("output.html", "w ")
Outfile. write (urllib2.urlopen ("www.baidu.com"). read ())
Outfile. close ()
Reference: docs.python.org/library/urllib2.html
The source code printed by Python is different from the source code on the web page.
I have also encountered this problem. For example, if Google Translate's return value is clearly displayed on the webpage, but the source code cannot be crawled, the reason is that some websites are anti-collected, this requires simulating the browser. Otherwise, more requests are required for cookies and account authentication.