Python crawlers collect 360 of search Lenovo words

Last Update:2014-06-16 Source: Internet

Author: User

Tags keyword list

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

There was a video dedicated to this crawler, but it was too cool. By the way, it had been sorted out, and up to now, 360 was not silly and had evolved. It was a bit of a bug to use the original method, this will be explained later. The subject is as follows:

Language: python2.7.6

Module: urllib, urllib2, re, time

Objective: To input arbitrary words and capture their associated words

Version: w1

Principle: On the 360 page, search for the home page: Taobao. Before entering a keyword, right-click the home page and select "review element" -- "Network" -- "Name". After Entering the keyword, the corresponding hyperlink will appear below, we only observe "Headers" "Priview" and "Headers". We can see "Request URL" and header information (host, proxy, and so on ), "Priview" shows an example of my input:

Suggest_so ({"query": "technology", "result": [{"word": "technology Aesthetics" },{ "word": "Technology Court "}, {"word": "Science and Technology Department" },{ "word": "Science and Technology Management Research" },{ "word": "", "obdata ": "{\" t \ ": \" video \ ", \" d \ ": [2, \" http: \ // define qhimg.com \/d \/examples \ ", \" \ u9ad8 \ u79d1 \ u6280 \ u5c11 \ u5973 \ u55b5 \ ", \" http: \ // www.360kan.com \/TV \/q4pwa7lrg4lnn.html \ ", 3,12]}" },{ "word": "technology daily" },{ "word ": "Major Advantages and Disadvantages of technological development" },{ "word": "super powerful technology" },{ "word": "Technology Network" },{ "word ": "scientific and technological progress and countermeasures"}], "version": ""});

Obviously, we just need to catch the word inside, forget to explain, in the Request URL, there is a link: http://sug.so.360.cn/suggest? Callback = suggest_so & encodein = UTF-8 & encodeout = UTF-8 & format = json & fields = word, obdata & word = % E7 % A7 % 91% E6 % 8A % 80% 20, we entered many times and found that it was just "% E7 % A7 % 91% E6 % 8A % 80% 20". That is to say, the previous part remains unchanged and can be used directly, the subsequent part changes with the input keyword. This is a URL encoding, which can be used by urllb. quote () method implementation.

Operation: 1. Add header information to read the webpage. Related Methods: urllib2.Request (), urllib2.urlopen (), urllib2, urlopen (). read ()

2. Regular Expression matching: Method: Describes the usage of the re module ..

The Code is as follows:

# Coding: utf-8import urllibimport urllib2import reimport timegjc = urllib. quote ("tech") url = "http://sug.so.360.cn/suggest? Callback = suggest_so & encodein = UTF-8 & encodeout = UTF-8 & format = json & fields = word, obdata & word = "+ gjcprint urlreq = urllib2.Request (url) html = urllib2.urlopen (req ). read () unicodePage = html. decode ("UTF-8") # regular expression. The findall method returns a list of ss = re. findall ('"word ":\"(. *?) \ "', UnicodePage) for item in ss: print item

Result:

If unicodePage = html. decode ("UTF-8"), the returned values will be interspersed with garbled characters. Let's verify if we are right. Open 360 and enter "technology". The result is as follows:

Don't worry about the sequence of the first and second related words. The result of my second request is changed. The result of the second request is changed. It may be that 360 is changing, you can try other keywords.

Well, the general framework has been implemented. This is an initial version and cannot be used without restrictions. What we need to do is smooth. What are the problems? <喎?http: www.bkjia.com kf ware vc " target="_blank" class="keylink"> VcD4KPHA + signature/Signature + NW + u + signature/bOjtPK/Signature + CjxwPjIux + vH87n9v + zSsr/Signature/Su7TOx + signature + bQ3c + i0rvPwqOs1eK + zcrHdGltZS5z Coding/coding + coding = "brush: java;"> # coding: UTF-8 # ------------------- # program: crawler collection 360 search related words # language: python2.7 # version: w1 # Time: 2014-06-14 # Author: wxx # --------------------- import urllibimport urllib2import re Import timefrom random import choice # ip proxy list iplist = ["14.29.117.36: 80", "222.66.115.229: 80", "59.46.72.245: 8080"] ip = choice (iplist) # print ip # keyword list, sequential search list = ["group", "technology", "python"] for m in list: # quote converts m to URL encoding gjc = urllib. quote (m) url = "http://sug.so.360.cn/suggest? Callback = suggest_so & encodein = UTF-8 & encodeout = UTF-8 & format = json & fields = word, obdata & word = "+ gjc # header information headers = {" GET ": url, "Host": "sug.so.360.cn", "Referer": "http://www.so.com/", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 QIHU 360SE "} # Use the IP Proxy Server proxy_handler = urllib2.ProxyHandler ({'HTTP ': 'http: //' + ip}) opener = urllib 2. build_opener (proxy_handler) urllib2.install _ opener (opener) req = urllib2.Request (url) for key in headers: req. add_header (key, headers [key]) html = urllib2.urlopen (req ). read () # convert other codes into unicode code unicodePage = html. decode ("UTF-8") # regular expression. The findall method returns a list of ss = re. findall ('"word ":\"(. *?) \ "', UnicodePage) for item in ss: print item # sleep for 2 seconds time. sleep (2)
Result:

Optimization considerations for the next version:

1. allow users to enter key I-words on their own. do not define a keyword list in advance.

2. Press enter to enter the next keyword.

3. Save the txt text in the output result

4. the user enters exit and the program exits.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More