Premise:
python3.4
Windows
Role: Search for related articles through Sogou search interface, and import the title and related links into Excel table
Description: Need Xlsxwriter module, another program written for 2017/7/11, so that the program can not be used after the site may have made the relevant changes, the program is relatively simple, remove the comment more than 40 lines.
Business:
Idea: Open the initial URL--to get the title and link--Change the page loop second step--and get the title and link into Excel
The first step of the crawler is to manually operate the first time (gossip)
Enter the above mentioned URL, such as input: "Image recognition", search, URL changed to "" "red as an important parameter, Type=1 is the search for the public number, for the time being, query= ' search keyword ', the key word has been coded, there is a hidden parameter page=1
When you jump to the second page, you'll see ""
Well, the URL can get the
1 url = ' http://weixin.sogou.com/weixin?type=2&query= ' +search+ ' &page= ' +str (page)
Search is the keyword you are searching for, and you can insert it with quote () code
1 search = urllib.request.quote (search)
Page is used for looping.
1 for page in range (1,pagenum+1): 2 url = ' http://weixin.sogou.com/weixin?type=2&query= ' +search+ ' &page= ' + STR (page)
The full URL has been obtained, then the URL is accessed and the data is obtained (Create opener object, add header ())
1 Import urllib.request2 header = (' User-agent ', ' mozilla/5.0 ') 3 opener = Urllib.request.build_opener () 4 Opener.addheaders = [header]5 Urllib.request.install_opener (opener) 6 data = Urllib.request.urlopen (URL). Read (). Decode ()
Get the content of the page, use regular expression to get the relevant data
1 Import re2 finddata = Re.compile (' <a target= ' _blank ' href= ' (. *?) ". *?uigs= "article_title_.*?" > (. *?) </a> '). FindAll (data) 3 #finddata = [(' ', ' '), (', ')]
There is a disturbance in the data obtained through the regular (link: ' amp; ' ) and unrelated items (title: ' <em><...><....></em> '), resolved with replace ()
1 title = Title.replace (' <em><!--red_beg--> ', ') 2 title = Title.replace (' <!--red_end--></em> ','')
1 link = link.replace (' amp; ', ')
Save the processed title and link in the list
1 title_link.append (link) 2 title_link.append (title)
So the title and the link to the search are given, and then the Excel is imported
Create Excel First
1 Import Xlsxwriter2 workbook = Xlsxwriter. Workbook (search+ '. xlsx ')
3 worksheet = Workbook.add_worksheet (")
Import data from Title_link to Excel
1 for I in range (0,len (Title_link), 2): 2 worksheet.write (' A ' +str (i+1), title_link[i+1]) 3 worksheet.write (' C ' +str (i+1), Title_link[i]) 4 workbook.close ()
Full code:
1 "2 python3.4 + Windows 3 Feather van -2017/7/11-4 used to search for articles, save titles and links to Excel 5 per page 10 seconds delay, prevent being limited to 6 import Urllib.request,xlsxwriter, Re,time 7 "8 Import urllib.request 9 search = str (input (" search article: ")) pagenum = Int (input (' search page: ')) Import Xlsxwriter12 Workbook = Xlsxwriter. Workbook (search+ ' xlsx ') search = urllib.request.quote (search) Title_link = []15 for page in range (1,pagenum+1): 16 url = ' http://weixin.sogou.com/weixin?type=2&query= ' +search+ ' &page= ' +str (page), import urllib.request18 Header = (' User-agent ', ' mozilla/5.0 ') opener = Urllib.request.build_opener () opener.addheaders = [header]21 Urllib.request.install_opener (opener), data = Urllib.request.urlopen (URL). read (). Decode () Import re24 fin Ddata = Re.compile (' <a target= ' _blank ' href= ' (. *?) ". *?uigs= "article_title_.*?" > (. *?) </a> '). FindAll (data) #finddata = [(', ', '), (', ')]26 for I in range (len (finddata)): title = Find data[i][1]28 title = TitlE.replace (' <em><!--red_beg--> ', '), title = Title.replace (' <!--red_end--></em> ', ') 30 try:31 #标题中可能存在引号32 title = Title.replace (' “ ', ' "') title = Title.repl Ace (' ” ', ' "') except:35 pass36 link = finddata[i][0]37 link = link.replace ( ' amp; ', ') title_link.append (link) title_link.append (title) print (' +str (page) + page ') impor T Time42 time.sleep (TEN) worksheet = Workbook.add_worksheet (") worksheet.set_column (' a:a '), Worksheet.set_col Umn (' c:c ', +), Bold = Workbook.add_format ({' Bold ': True}) worksheet.write (' A1 ', ' title ', Bold) worksheet.write (' C1 ') , ' link ', bold)-For I in Range (0,len (Title_link), 2): Worksheet.write (' A ' +str (i+1), title_link[i+1]) Wuyi Worksheet.wri Te (' C ' +str (i+1), title_link[i]) print (' Import Excel finished! ')