1, first find an online idiom website
2, view the structure of the Web page, define the regular style
Look at the label of the idiom to grasp what is the characteristics of the source, you can find to catch the idioms are in <a> tags, such as: <a href= "/cy0/93.html" > Anrupanshi </a>, the idiom is in fact a aiming text, Different idioms point to the link is different, in fact, "/cy0/93.html" in the number is different, so the regular formula matching two times the number on the line, the definition of regular reg = "<a href=\"/cy (\d+)/(\d+). Html\ "> (. *?) </a> ".
3, on the code bar
Copy Code code as follows:
#anthor Jiqunpeng
#time 20121124
Import Urllib
Import re
def gethtml (URL): #从URL中读取html内容
page = Urllib.urlopen (URL)
html = Page.read ()
Page.close ()
return HTML
def getdictionary (HTML): #匹配成语
Reg = "<a href=\"/cy (\d+)/(\d+). Html\ "> (. *?) </a> "
Diclist = Re.compile (reg). FindAll (HTML)
Return diclist
Def getitemsite (): #手工把每个字母开头的页面数统计下来
Itemsite = {} #申明为空字典
itemsite["A"] = 3
itemsite["B"] = 21
itemsite["C"] = 19
itemsite["D"] = 18
itemsite["E"] = 2
itemsite["F"] = 14
itemsite["G"] = 13
itemsite["H"] = 15
itemsite["J"] = 23
itemsite["K"] = 6
itemsite["L"] = 15
itemsite["M"] = 12
itemsite["N"] = 5
itemsite["O"] = 1
itemsite["P"] = 6
itemsite["Q"] = 16
itemsite["R"] = 8
itemsite["S"] = 26
itemsite["T"] = 12
itemsite["W"] = 13
itemsite["X"] = 16
itemsite["Y"] = 35
itemsite["A"] = 21
Return Itemsite
If __name__== "__main__":
Dicfile = open ("Dic.txt", "w+") #保存成语的文件
Domainsite = "http://chengyu.itlearner.com/list/"
Itemsite = Getitemsite ()
For key,values in Itemsite.items ():
For index in range (1,values+1):
Site = key + "_" +str (index) + ". html"
Dictionary = getdictionary (gethtml (Domainsite+site))
For DIC in Dictionary:
Dicfile.write (dic[2]+ "@ @CY \ n") #标记为成语, participle use
Print key+ ' alphabetical idiom crawl complete '
Dicfile.close ()
print ' All idioms crawl complete '
The idiom was saved in txt text, and a suffix tag was added.
Finally, note that the design of the regular expression may be clearly thought to be correct, is not matched, to note the white space characters, such as to resolve:
Copy Code code as follows:
<div class= "Avatar_name" >
<a href= "/u/kkun/" title= "Kkun" >kkun</a>
</div>
You can't see what the first and second lines of white space characters are, you can index = html.find (' Avatar_name '), html[4677:4677+100] See non-white-space characters.