A library of idioms is required for creating an nlpproject. I need pure idioms. I have a detailed explanation of what I am looking for online. So I wrote a Python code to crawl idioms.Program.
1. First, find an online idiom website
The website I selected is website.
2. view the webpage structure and define the Regular Expression
Let's take a look at the characteristics of the tag of the idiom to be captured. Check the source code and find that all the Idioms To be caught are in the <A> tag, for example: <a href = "/cy0/93.html"> ANRU Rock </a>: An idiom is actually a targeted text. Different idioms have different links, actually, the numbers in "/cy0/93.html" are different, so match the numbers twice in the regular expression, define the regular expression Reg = "<a href = \"/CY (\ D +)/(\ d1_0000.html \ "> (. *?) </A> ".
3. UploadCodeRight
# Anthor jiqunpeng # Time 20121124 Import Urllib Import Re Def Gethtml (URL ): # Read HTML content from URL Page = Urllib. urlopen (URL) HTML = Page. Read () page. Close () Return Html Def Getdictionary (HTML ): # Matching Idioms Reg = " <A href = \ "/CY (\ D +)/(\ d1_0000.html \"> (.*?) </A> " Diclist = Re. Compile (REG). findall (HTML) Return Diclist Def Getitemsite (): # Manually count the number of pages starting with each letter Itemsite = {} # Empty dictionary Itemsite [ " A " ] = 3Itemsite [ " B " ] = 21 Itemsite [ " C " ] = 19 Itemsite [ " D " ] = 18 Itemsite [ " E " ] = 2Itemsite [ " F " ] = 14 Itemsite [ " G " ] = 13 Itemsite [ " H " ] = 15 Itemsite [ " J " ] = 23Itemsite [ " K " ] = 6 Itemsite [ " L " ] = 15 Itemsite [ " M " ] = 12 Itemsite [ " N " ] = 5Itemsite [ " O " ] = 1 Itemsite [ " P " ] = 6 Itemsite [ " Q " ] = 16 Itemsite [ " R " ] = 8Itemsite [ " S " ] = 26 Itemsite [ " T " ] = 12 Itemsite [ " W " ] = 13 Itemsite [ " X " ] = 16Itemsite [ " Y " ] = 35 Itemsite [ " A " ] = 21 Return Itemsite If _ Name __ = " _ Main __ " : Dicfile = Open (" Dic.txt " , " W + " ) # Save the idiom File Domainsite = " Http://chengyu.itlearner.com/list/ " Itemsite = Getitemsite () For Key, values In Itemsite. Items (): For Index In Range (1, values + 1 ): Site = Key + " _ " + STR (INDEX) + " . Html " Dictionary = Getdictionary (gethtml (domainsite + Site )) For DicIn Dictionary: dicfile. Write (DIC [ 2] + " @ Cy \ n " ) # Used for word segmentation. Print Key + ' Letter and idiom crawled ' Dicfile. Close () Print ' All idioms have been crawled '
Saves idioms in TXT text and adds a suffix tag. This method is too stupid. We should be able to automatically search for the next page, instead of determining the number of pages first. You will have time to complete the project + examination recently.
Note: When designing a regular expression, you may find that the regular expression is correct, that is, it cannot be matched. Pay attention to the blank characters, such as parsing:
<Div class = "avatar_name"> <a href = "/u/Kkun/" Title = "Kkun"> Kkun </a> </div>
You cannot see the blank characters in the first and second lines. Index = html. find ('Avatar _ name'), HTML [4677: 4677 + 100] To see non-blank characters