Python parses idioms step by step

Source: Internet
Author: User

A library of idioms is required for creating an nlpproject. I need pure idioms. I have a detailed explanation of what I am looking for online. So I wrote a Python code to crawl idioms.Program.

1. First, find an online idiom website

The website I selected is website.

2. view the webpage structure and define the Regular Expression

Let's take a look at the characteristics of the tag of the idiom to be captured. Check the source code and find that all the Idioms To be caught are in the <A> tag, for example: <a href = "/cy0/93.html"> ANRU Rock </a>: An idiom is actually a targeted text. Different idioms have different links, actually, the numbers in "/cy0/93.html" are different, so match the numbers twice in the regular expression, define the regular expression Reg = "<a href = \"/CY (\ D +)/(\ d1_0000.html \ "> (. *?) </A> ".

3. UploadCodeRight

 #  Anthor jiqunpeng  #  Time 20121124  Import Urllib  Import  Re  Def Gethtml (URL ): #  Read HTML content from URL Page = Urllib. urlopen (URL) HTML = Page. Read () page. Close ()  Return  Html  Def Getdictionary (HTML ): #  Matching Idioms Reg = " <A href = \ "/CY (\ D +)/(\ d1_0000.html \"> (.*?) </A>  "  Diclist = Re. Compile (REG). findall (HTML)  Return  Diclist  Def Getitemsite (): #  Manually count the number of pages starting with each letter Itemsite = {} #  Empty dictionary Itemsite [ "  A  " ] = 3Itemsite [  "  B  " ] = 21 Itemsite [  "  C  " ] = 19 Itemsite [  "  D  " ] = 18 Itemsite [  "  E  " ] = 2Itemsite [  "  F  " ] = 14 Itemsite [  "  G  " ] = 13 Itemsite [  "  H  " ] = 15 Itemsite [  "  J  " ] = 23Itemsite [  "  K  " ] = 6 Itemsite [  "  L  " ] = 15 Itemsite [  "  M  " ] = 12 Itemsite [  "  N  " ] = 5Itemsite [  "  O  " ] = 1 Itemsite [  "  P  " ] = 6 Itemsite [  "  Q  " ] = 16 Itemsite [  "  R  " ] = 8Itemsite [  "  S  " ] = 26 Itemsite [  "  T  " ] = 12 Itemsite [  "  W  " ] = 13 Itemsite [  "  X  " ] = 16Itemsite [  "  Y  " ] = 35 Itemsite [  "  A  " ] = 21 Return  Itemsite  If   _ Name __ = "  _ Main __  "  : Dicfile = Open ("  Dic.txt  " , "  W +  " ) #  Save the idiom File Domainsite = "  Http://chengyu.itlearner.com/list/  "  Itemsite = Getitemsite ()  For Key, values In Itemsite. Items ():  For Index In Range (1, values + 1 ): Site = Key + "  _  " + STR (INDEX) + "  . Html  "  Dictionary = Getdictionary (gethtml (domainsite + Site ))  For DicIn  Dictionary: dicfile. Write (DIC [ 2] + "  @ Cy \ n  " ) #  Used for word segmentation.          Print Key + '  Letter and idiom crawled  '  Dicfile. Close ()  Print   '  All idioms have been crawled '         

Saves idioms in TXT text and adds a suffix tag. This method is too stupid. We should be able to automatically search for the next page, instead of determining the number of pages first. You will have time to complete the project + examination recently.

Note: When designing a regular expression, you may find that the regular expression is correct, that is, it cannot be matched. Pay attention to the blank characters, such as parsing:

 
<Div class = "avatar_name"> <a href = "/u/Kkun/" Title = "Kkun"> Kkun </a> </div>

You cannot see the blank characters in the first and second lines. Index = html. find ('Avatar _ name'), HTML [4677: 4677 + 100] To see non-blank characters

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.