0x 00 Preface
The day before yesterday, when they play, their own through the Baidu Search host two-level domain name feel good trouble, from a page to the page to turn
and manually identify whether it is a duplicate two-level domain name is also enough egg pain, just recently learning regular expression, right when practiced hand
0x 00 code
# coding=utf-8# Author:anka9080import urllibimport urllib2import cookielibimport reurl = ' http://www.haosou.com/s?src= 360sou_newhome&q=site:tjut.edu.cn&pn=1 ' req = urllib2. Request (URL) res = Urllib2.urlopen (req) HTML = Res.read (). Decode (' utf-8 ') pagestr = re.search (ur ') find related results about (. *?) A ', HTML) page = Pagestr.group (1) formatnum = ' 0123456789 ' for C in Page:if not C in formatnum:page = Page.replace ( C, ') page = Int (page)/10print page# number of pages for search results if page > 5:page = 5newItems = []for p in range (0, page): url = ' http://www.haosou.com/s?src=360sou_newhome&q=site:tjut.edu.cn&pn= ' + ' p ' req = urllib2. Request (URL) res = Urllib2.urlopen (req) HTML = Res.read (). Decode (' utf-8 ') pattern = Re.compile (R ' linkinfo\ "\>\ <cite\> (. +?\.tjut\.edu\.cn) ') items = Re.findall (pattern, HTML) # to re-operate for item in items: If Item not in NewItems:newItems.append (item) # Prints the list of sub-domains to be de-renamed for the item in NewItems: PriNT Item#print HTML
The test results are as follows:
1330www.tjut.edu.cnmy.tjut.edu.cnjw.tjut.edu.cnjyzx.tjut.edu.cnlib.tjut.edu.cncs.tjut.edu.cnyjs.tjut.edu.cnmail.tjut.edu . cnacm.tjut.edu.cnwww.tjut.edu.cnmy.tjut.edu.cnjw.tjut.edu.cnjyzx.tjut.edu.cnlib.tjut.edu.cncs.tjut.edu.cnyjs.tjut.edu.cn Mail.tjut.edu.cnacm.tjut.edu.cnwww.tjut.edu.cnmy.tjut.edu.cnjw.tjut.edu.cnjyzx.tjut.edu.cnlib.tjut.edu.cncs.tjut.edu.cnyj S.tjut.edu.cnmail.tjut.edu.cnacm.tjut.edu.cnwww.tjut.edu.cnmy.tjut.edu.cnjw.tjut.edu.cnjyzx.tjut.edu.cnlib.tjut.edu.cncs . tjut.edu.cnyjs.tjut.edu.cnmail.tjut.edu.cnacm.tjut.edu.cnwww.tjut.edu.cnmy.tjut.edu.cnjw.tjut.edu.cnjyzx.tjut.edu.cnlib . tjut.edu.cncs.tjut.edu.cnyjs.tjut.edu.cnmail.tjut.edu.cnacm.tjut.edu.cn
0x 02 Summary
This is probably the way of thinking:
First through urllib2. Request () and urllib2.urlopen () access URL
And get the number of search results pages from the returned results.
To increase the efficiency page number greater than 5 will crawl only the first 5 pages of the search results
Then you do the redo operation and then get the two-level domain Name list:)
The place where the middle egg hurts is the escape symbol of the Py. There is a question around can ask Daniel how good ~
Post-readiness to use http://dns.aizhan.com/ query results to obtain IP and side-station information directly
Text in the picture quoted: http://developer.51cto.com/art/201403/431104.htm (original blog link invalid)
"Python" Crawl search engine results get all level two domain name of designated host