"Python" Crawl search engine results get all level two domain name of designated host

Source: Internet
Author: User

    

0x 00 Preface

The day before yesterday, when they play, their own through the Baidu Search host two-level domain name feel good trouble, from a page to the page to turn

and manually identify whether it is a duplicate two-level domain name is also enough egg pain, just recently learning regular expression, right when practiced hand

0x 00 code

    

# coding=utf-8# Author:anka9080import urllibimport urllib2import cookielibimport reurl = ' http://www.haosou.com/s?src= 360sou_newhome&q=site:tjut.edu.cn&pn=1 ' req = urllib2. Request (URL) res = Urllib2.urlopen (req) HTML = Res.read (). Decode (' utf-8 ') pagestr = re.search (ur ') find related results about (. *?) A ', HTML) page = Pagestr.group (1) formatnum = ' 0123456789 ' for C in Page:if not C in formatnum:page = Page.replace ( C, ') page = Int (page)/10print page# number of pages for search results if page > 5:page = 5newItems = []for p in range (0, page): url = ' http://www.haosou.com/s?src=360sou_newhome&q=site:tjut.edu.cn&pn= ' + ' p ' req = urllib2. Request (URL) res = Urllib2.urlopen (req) HTML = Res.read (). Decode (' utf-8 ') pattern = Re.compile (R ' linkinfo\ "\>\        <cite\> (. +?\.tjut\.edu\.cn) ') items = Re.findall (pattern, HTML) # to re-operate for item in items:            If Item not in NewItems:newItems.append (item) # Prints the list of sub-domains to be de-renamed for the item in NewItems: PriNT Item#print HTML 

 The test results are as follows:

1330www.tjut.edu.cnmy.tjut.edu.cnjw.tjut.edu.cnjyzx.tjut.edu.cnlib.tjut.edu.cncs.tjut.edu.cnyjs.tjut.edu.cnmail.tjut.edu . cnacm.tjut.edu.cnwww.tjut.edu.cnmy.tjut.edu.cnjw.tjut.edu.cnjyzx.tjut.edu.cnlib.tjut.edu.cncs.tjut.edu.cnyjs.tjut.edu.cn Mail.tjut.edu.cnacm.tjut.edu.cnwww.tjut.edu.cnmy.tjut.edu.cnjw.tjut.edu.cnjyzx.tjut.edu.cnlib.tjut.edu.cncs.tjut.edu.cnyj S.tjut.edu.cnmail.tjut.edu.cnacm.tjut.edu.cnwww.tjut.edu.cnmy.tjut.edu.cnjw.tjut.edu.cnjyzx.tjut.edu.cnlib.tjut.edu.cncs . tjut.edu.cnyjs.tjut.edu.cnmail.tjut.edu.cnacm.tjut.edu.cnwww.tjut.edu.cnmy.tjut.edu.cnjw.tjut.edu.cnjyzx.tjut.edu.cnlib . tjut.edu.cncs.tjut.edu.cnyjs.tjut.edu.cnmail.tjut.edu.cnacm.tjut.edu.cn

  

0x 02 Summary

This is probably the way of thinking:

First through urllib2. Request () and urllib2.urlopen () access URL

And get the number of search results pages from the returned results.

To increase the efficiency page number greater than 5 will crawl only the first 5 pages of the search results

Then you do the redo operation and then get the two-level domain Name list:)

The place where the middle egg hurts is the escape symbol of the Py. There is a question around can ask Daniel how good ~

Post-readiness to use http://dns.aizhan.com/ query results to obtain IP and side-station information directly

Text in the picture quoted: http://developer.51cto.com/art/201403/431104.htm (original blog link invalid)

    

"Python" Crawl search engine results get all level two domain name of designated host

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.