1, first find an online idiom website
2, view the structure of the Web page, define the regular style
Look at the label of the idiom to catch what characteristics, view the source code, you can find the idiom to grasp in the label, such as: Anrupanshi, idiom is actually a target text, different idioms point to the link is different, in fact, "/cy0/93.html" in the number is different, so the regular formula to match two times the number on the line, Define Regular Reg = "(. *?)".
3, the Code bar
Copy the Code code as follows:
#anthor Jiqunpeng
#time 20121124
Import Urllib
Import re
def gethtml (URL): #从URL中读取html内容
page = Urllib.urlopen (URL)
html = Page.read ()
Page.close ()
return HTML
def getdictionary (HTML): #匹配成语
Reg = "(. *?)"
Diclist = Re.compile (reg). FindAll (HTML)
Return diclist
Def getitemsite (): #手工把每个字母开头的页面数统计下来
Itemsite = {} #申明为空字典
itemsite["A"] = 3
itemsite["B"] = +
Itemsit e["C"] = +
itemsite["D"] = +
itemsite["E"] = 2
itemsite["F"] = +
itemsite["G"] = all
itemsite["H"] =
itemsite["J"] = all
itemsite["K"] = 6
itemsite["L"] =
itemsite["M"] = [+]
itemsite["N"] = 5
I temsite["O"] = 1
itemsite["P"] = 6
itemsite["Q"] =
itemsite["R"] = 8
itemsite["S"] = +
itemsite["T "] =
itemsite[" W "] = all
itemsite[" X "] =
itemsite[" Y "] =
itemsite[" A "] = +
Return ITEMSITE
If __name__== "__main__":
Dicfile = open ("Dic.txt", "w+") #保存成语的文件
Domainsite = "http://chengyu.itlearner.com/list/"
Itemsite = Getitemsite ()
For key,values in Itemsite.items ():
For index in range (1,values+1):
Site = key + "_" +str (index) + ". html"
Dictionary = getdictionary (gethtml (Domainsite+site))
For DIC in Dictionary:
Dicfile.write (dic[2]+ "@ @CY \ n") #标记为成语, use when participle
Print key+ ' alphabet idiom Crawl complete '
Dicfile.close ()
print ' All idiom crawl complete '
The idiom was saved in the txt text and a suffix tag was added.
Finally note that the design of regular expressions may appear when it is clearly believed to be correct, is not matched, the white space characters to pay attention to, for example, to parse:
Copy the Code code as follows:
Kkun
You do not see what the first line and the second line of whitespace characters are, you can index = html.find (' Avatar_name '), html[4677:4677+100] see non-whitespace characters.