Python-written web crawler (very simple)
This is one of my classmates passed to me a small web crawler, feel very interesting, and share with you. However, there is a point to note, to use python2.3, if the use of python3.4 will be some problems arise.
The Python program is as follows:
Import re,urllibstrtxt= "" X=1ff=open ("Wangzhi.txt", "R") for line in Ff.readlines (): F=open (str (x) + ". txt", "w+") print Linen=re.findall (R "<p> (. *?) <\/p> ", Urllib.urlopen (line). Read (), Re. M) for I in N:if Len (i)!=0:i=i.replace ("", "") i= i.replace ("<strong>", "") i = i.replace ("</strong>", "") Strtxt = strtxt + I strtxt = Re.sub (r "<a href= (. *?) > ", R" ", Strtxt) strtxt=re.sub (r" <a (. *?) > ", R" ", Strtxt) strtxt=re.sub (r" <span> (. *?) </span> ", R" ", Strtxt) strtxt = Re.sub (r" <\/[Aa]> ", R" ", Strtxt) #print strtxt f.write (strtxt ) strtxt= "" f.close x=x+1ff.close () </span>
The contents of Wangzhi.txt are as follows:
Http://sports.163.com/14/1126/22/AC0TVK4E00052UUC.html
Http://sports.163.com/14/1126/22/AC0TGD4700052UUC.html
Http://sports.163.com/14/1126/22/AC0TAHNK00052UUC.html
Results Analysis:
Run the program, there are 3 output files, respectively, 3 URL address corresponding to the content of the page.
Python-written web crawler (very simple)