Python Regular expression Crawl idiom website

Source: Internet
Author: User
1, first find an online idiom website

2, view the structure of the Web page, define the regular style

Look at the label of the idiom to catch what characteristics, view the source code, you can find the idiom to grasp in the label, such as: Anrupanshi, idiom is actually a target text, different idioms point to the link is different, in fact, "/cy0/93.html" in the number is different, so the regular formula to match two times the number on the line, Define Regular Reg = "(. *?)".
3, the Code bar

Copy the Code code as follows:


#anthor Jiqunpeng
#time 20121124
Import Urllib
Import re

def gethtml (URL): #从URL中读取html内容
page = Urllib.urlopen (URL)
html = Page.read ()
Page.close ()
return HTML

def getdictionary (HTML): #匹配成语
Reg = "(. *?)"
Diclist = Re.compile (reg). FindAll (HTML)
Return diclist

Def getitemsite (): #手工把每个字母开头的页面数统计下来
Itemsite = {} #申明为空字典
itemsite["A"] = 3
itemsite["B"] = +
Itemsit e["C"] = +
itemsite["D"] = +
itemsite["E"] = 2
itemsite["F"] = +
itemsite["G"] = all
itemsite["H"] =
itemsite["J"] = all
itemsite["K"] = 6
itemsite["L"] =
itemsite["M"] = [+]
itemsite["N"] = 5
I temsite["O"] = 1
itemsite["P"] = 6
itemsite["Q"] =
itemsite["R"] = 8
itemsite["S"] = +
itemsite["T "] =
itemsite[" W "] = all
itemsite[" X "] =
itemsite[" Y "] =
itemsite[" A "] = +
Return ITEMSITE

If __name__== "__main__":
Dicfile = open ("Dic.txt", "w+") #保存成语的文件
Domainsite = "http://chengyu.itlearner.com/list/"
Itemsite = Getitemsite ()
For key,values in Itemsite.items ():
For index in range (1,values+1):
Site = key + "_" +str (index) + ". html"
Dictionary = getdictionary (gethtml (Domainsite+site))
For DIC in Dictionary:
Dicfile.write (dic[2]+ "@ @CY \ n") #标记为成语, use when participle
Print key+ ' alphabet idiom Crawl complete '
Dicfile.close ()
print ' All idiom crawl complete '

The idiom was saved in the txt text and a suffix tag was added.
Finally note that the design of regular expressions may appear when it is clearly believed to be correct, is not matched, the white space characters to pay attention to, for example, to parse:

Copy the Code code as follows:


Kkun


You do not see what the first line and the second line of whitespace characters are, you can index = html.find (' Avatar_name '), html[4677:4677+100] see non-whitespace characters.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.