Python Regular expression crawl idiom website _python

Source: Internet
Author: User

1, first find an online idiom website

2, view the structure of the Web page, define the regular style

Look at the label of the idiom to grasp what is the characteristics of the source, you can find to catch the idioms are in <a> tags, such as: <a href= "/cy0/93.html" > Anrupanshi </a>, the idiom is in fact a aiming text, Different idioms point to the link is different, in fact, "/cy0/93.html" in the number is different, so the regular formula matching two times the number on the line, the definition of regular reg = "<a href=\"/cy (\d+)/(\d+). Html\ "> (. *?) </a> ".
3, on the code bar

Copy Code code as follows:

#anthor Jiqunpeng
#time 20121124
Import Urllib
Import re

def gethtml (URL): #从URL中读取html内容
page = Urllib.urlopen (URL)
html = Page.read ()
Page.close ()
return HTML

def getdictionary (HTML): #匹配成语
Reg = "<a href=\"/cy (\d+)/(\d+). Html\ "> (. *?) </a> "
Diclist = Re.compile (reg). FindAll (HTML)
Return diclist

Def getitemsite (): #手工把每个字母开头的页面数统计下来
Itemsite = {} #申明为空字典
itemsite["A"] = 3
itemsite["B"] = 21
itemsite["C"] = 19
itemsite["D"] = 18
itemsite["E"] = 2
itemsite["F"] = 14
itemsite["G"] = 13
itemsite["H"] = 15
itemsite["J"] = 23
itemsite["K"] = 6
itemsite["L"] = 15
itemsite["M"] = 12
itemsite["N"] = 5
itemsite["O"] = 1
itemsite["P"] = 6
itemsite["Q"] = 16
itemsite["R"] = 8
itemsite["S"] = 26
itemsite["T"] = 12
itemsite["W"] = 13
itemsite["X"] = 16
itemsite["Y"] = 35
itemsite["A"] = 21
Return Itemsite

If __name__== "__main__":
Dicfile = open ("Dic.txt", "w+") #保存成语的文件
Domainsite = "http://chengyu.itlearner.com/list/"
Itemsite = Getitemsite ()
For key,values in Itemsite.items ():
For index in range (1,values+1):
Site = key + "_" +str (index) + ". html"
Dictionary = getdictionary (gethtml (Domainsite+site))
For DIC in Dictionary:
Dicfile.write (dic[2]+ "@ @CY \ n") #标记为成语, participle use
Print key+ ' alphabetical idiom crawl complete '
Dicfile.close ()
print ' All idioms crawl complete '

The idiom was saved in txt text, and a suffix tag was added.
Finally, note that the design of the regular expression may be clearly thought to be correct, is not matched, to note the white space characters, such as to resolve:

Copy Code code as follows:

<div class= "Avatar_name" >

<a href= "/u/kkun/" title= "Kkun" >kkun</a>

</div>

You can't see what the first and second lines of white space characters are, you can index = html.find (' Avatar_name '), html[4677:4677+100] See non-white-space characters.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.