Python3 crawling micro-letter articles

Source: Internet
Author: User
Premise:

python3.4

Windows

Role: Search for related articles through Sogou search interface, and import the title and related links into Excel table

Description: Need Xlsxwriter module, another program written for 2017/7/11, so that the program can not be used after the site may have made the relevant changes, the program is relatively simple, remove the comment more than 40 lines.

Business:

Idea: Open the initial URL--to get the title and link--Change the page loop second step--and get the title and link into Excel

The first step of the crawler is to manually operate the first time (gossip)

Enter the above mentioned URL, such as input: "Image recognition", search, URL changed to "" "red as an important parameter, Type=1 is the search for the public number, for the time being, query= ' search keyword ', the key word has been coded, there is a hidden parameter page=1

When you jump to the second page, you'll see ""

Well, the URL can get the

1 url = ' http://weixin.sogou.com/weixin?type=2&query= ' +search+ ' &page= ' +str (page)

Search is the keyword you are searching for, and you can insert it with quote () code

1 search = urllib.request.quote (search)

Page is used for looping.

1 for page in range (1,pagenum+1): 2     url = ' http://weixin.sogou.com/weixin?type=2&query= ' +search+ ' &page= ' + STR (page)

The full URL has been obtained, then the URL is accessed and the data is obtained (Create opener object, add header ())

1 Import urllib.request2     header = (' User-agent ', ' mozilla/5.0 ') 3     opener = Urllib.request.build_opener () 4     Opener.addheaders = [header]5     Urllib.request.install_opener (opener) 6     data = Urllib.request.urlopen (URL). Read (). Decode ()

Get the content of the page, use regular expression to get the relevant data

1 Import re2     finddata = Re.compile (' <a target= ' _blank ' href= ' (. *?) ". *?uigs= "article_title_.*?" > (. *?) </a> '). FindAll (data) 3     #finddata = [(' ', ' '), (', ')]

There is a disturbance in the data obtained through the regular (link: ' amp; ' ) and unrelated items (title: ' <em><...><....></em> '), resolved with replace ()

1 title = Title.replace (' <em><!--red_beg--> ', ') 2 title = Title.replace (' <!--red_end--></em> ','')
1 link = link.replace (' amp; ', ')

Save the processed title and link in the list

1 title_link.append (link) 2 title_link.append (title)

So the title and the link to the search are given, and then the Excel is imported

Create Excel First

1 Import Xlsxwriter2 workbook = Xlsxwriter. Workbook (search+ '. xlsx ')
3 worksheet = Workbook.add_worksheet (")

Import data from Title_link to Excel

1 for I in range (0,len (Title_link), 2): 2     worksheet.write (' A ' +str (i+1), title_link[i+1]) 3     worksheet.write (' C ' +str (i+1), Title_link[i]) 4 workbook.close ()

Full code:

 1 "2 python3.4 + Windows 3 Feather van -2017/7/11-4 used to search for articles, save titles and links to Excel 5 per page 10 seconds delay, prevent being limited to 6 import Urllib.request,xlsxwriter,  Re,time 7 "8 Import urllib.request 9 search = str (input (" search article: ")) pagenum = Int (input (' search page: ')) Import Xlsxwriter12 Workbook = Xlsxwriter.     Workbook (search+ ' xlsx ') search = urllib.request.quote (search) Title_link = []15 for page in range (1,pagenum+1): 16     url = ' http://weixin.sogou.com/weixin?type=2&query= ' +search+ ' &page= ' +str (page), import urllib.request18     Header = (' User-agent ', ' mozilla/5.0 ') opener = Urllib.request.build_opener () opener.addheaders = [header]21 Urllib.request.install_opener (opener), data = Urllib.request.urlopen (URL). read (). Decode () Import re24 fin Ddata = Re.compile (' <a target= ' _blank ' href= ' (. *?) ". *?uigs= "article_title_.*?" > (. *?) </a> '). FindAll (data) #finddata = [(', ', '), (', ')]26 for I in range (len (finddata)): title = Find data[i][1]28 title = TitlE.replace (' <em><!--red_beg--> ', '), title = Title.replace (' <!--red_end--></em> ', ') 30 try:31 #标题中可能存在引号32 title = Title.replace (' &ldquo; ', ' "') title = Title.repl Ace (' &rdquo; ', ' "') except:35 pass36 link = finddata[i][0]37 link = link.replace ( ' amp; ', ') title_link.append (link) title_link.append (title) print (' +str (page) + page ') impor T Time42 time.sleep (TEN) worksheet = Workbook.add_worksheet (") worksheet.set_column (' a:a '), Worksheet.set_col Umn (' c:c ', +), Bold = Workbook.add_format ({' Bold ': True}) worksheet.write (' A1 ', ' title ', Bold) worksheet.write (' C1 ') , ' link ', bold)-For I in Range (0,len (Title_link), 2): Worksheet.write (' A ' +str (i+1), title_link[i+1]) Wuyi Worksheet.wri Te (' C ' +str (i+1), title_link[i]) print (' Import Excel finished! ')

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.