Today try to use Python to write a web crawler code, mainly want to visit a website, select the information of interest, and save the information in a certain format in the early Excel.
This code mainly uses the following Python features, because Python is not familiar with, the code is also pasted below.
1. Open a Web page with a URL
Import Urllib2data = Urllib2.urlopen (String_full_link). Read (). Decode (' UTF8 ')
Print data
2, using regular expression matching
Import re# General English match reg = "" A href=\s* target= ' _blank ' title=\s* "" "Diclist = Re.compile (reg). FindAll (data) Print Diclist
#中文的正则匹配, you need to use unicode Codeaddrlist = Re.compile (reg) corresponding to the Unicode code reg=u "\u5730\u5740\s*" # "Address" in Chinese. FindAll (sub_ Data
Print Addrlist
3. Write data to Excel file
Import xlrdimport xlwt file = xlwt. Workbook () table = File.add_sheet (' HK ', cell_overwrite_ok=true) Print index, name, addr, tel table.write ( Index, 0, name) table.write (index, 1, addr) table.write (Index, 2, tel) file.save ("" "D:\\test.xls" "")
First web crawler written using Python