There was a video dedicated to this crawler, but it was too cool. By the way, it had been sorted out, and up to now, 360 was not silly and had evolved. It was a bit of a bug to use the original method, this will be explained later. The subject is as follows:
Language: python2.7.6
Module: urllib, urllib2, re, time
Objective: To input arbitrary words and capture their associated words
Version: w1
Principle: On the 360 page, search for the home page: Taobao. Before entering a keyword, right-click the home page and select "review element" -- "Network" -- "Name". After Entering the keyword, the corresponding hyperlink will appear below, we only observe "Headers" "Priview" and "Headers". We can see "Request URL" and header information (host, proxy, and so on ), "Priview" shows an example of my input:
Suggest_so ({"query": "technology", "result": [{"word": "technology Aesthetics" },{ "word": "Technology Court "}, {"word": "Science and Technology Department" },{ "word": "Science and Technology Management Research" },{ "word": "", "obdata ": "{\" t \ ": \" video \ ", \" d \ ": [2, \" http: \ // define qhimg.com \/d \/examples \ ", \" \ u9ad8 \ u79d1 \ u6280 \ u5c11 \ u5973 \ u55b5 \ ", \" http: \ // www.360kan.com \/TV \/q4pwa7lrg4lnn.html \ ", 3,12]}" },{ "word": "technology daily" },{ "word ": "Major Advantages and Disadvantages of technological development" },{ "word": "super powerful technology" },{ "word": "Technology Network" },{ "word ": "scientific and technological progress and countermeasures"}], "version": ""});
Obviously, we just need to catch the word inside, forget to explain, in the Request URL, there is a link: http://sug.so.360.cn/suggest? Callback = suggest_so & encodein = UTF-8 & encodeout = UTF-8 & format = json & fields = word, obdata & word = % E7 % A7 % 91% E6 % 8A % 80% 20, we entered many times and found that it was just "% E7 % A7 % 91% E6 % 8A % 80% 20". That is to say, the previous part remains unchanged and can be used directly, the subsequent part changes with the input keyword. This is a URL encoding, which can be used by urllb. quote () method implementation.
Operation: 1. Add header information to read the webpage. Related Methods: urllib2.Request (), urllib2.urlopen (), urllib2, urlopen (). read ()
2. Regular Expression matching: Method: Describes the usage of the re module ..
The Code is as follows:
# Coding: utf-8import urllibimport urllib2import reimport timegjc = urllib. quote ("tech") url = "http://sug.so.360.cn/suggest? Callback = suggest_so & encodein = UTF-8 & encodeout = UTF-8 & format = json & fields = word, obdata & word = "+ gjcprint urlreq = urllib2.Request (url) html = urllib2.urlopen (req ). read () unicodePage = html. decode ("UTF-8") # regular expression. The findall method returns a list of ss = re. findall ('"word ":\"(. *?) \ "', UnicodePage) for item in ss: print item
Result:
If unicodePage = html. decode ("UTF-8"), the returned values will be interspersed with garbled characters. Let's verify if we are right. Open 360 and enter "technology". The result is as follows:
Don't worry about the sequence of the first and second related words. The result of my second request is changed. The result of the second request is changed. It may be that 360 is changing, you can try other keywords.
Well, the general framework has been implemented. This is an initial version and cannot be used without restrictions. What we need to do is smooth. What are the problems? <喎?http: www.bkjia.com kf ware vc " target="_blank" class="keylink"> VcD4KPHA + signature/Signature + NW + u + signature/bOjtPK/Signature + CjxwPjIux + vH87n9v + zSsr/Signature/Su7TOx + signature + bQ3c + i0rvPwqOs1eK + zcrHdGltZS5z Coding/coding + coding = "brush: java;"> # coding: UTF-8 # ------------------- # program: crawler collection 360 search related words # language: python2.7 # version: w1 # Time: 2014-06-14 # Author: wxx # --------------------- import urllibimport urllib2import re Import timefrom random import choice # ip proxy list iplist = ["14.29.117.36: 80", "222.66.115.229: 80", "59.46.72.245: 8080"] ip = choice (iplist) # print ip # keyword list, sequential search list = ["group", "technology", "python"] for m in list: # quote converts m to URL encoding gjc = urllib. quote (m) url = "http://sug.so.360.cn/suggest? Callback = suggest_so & encodein = UTF-8 & encodeout = UTF-8 & format = json & fields = word, obdata & word = "+ gjc # header information headers = {" GET ": url, "Host": "sug.so.360.cn", "Referer": "http://www.so.com/", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 QIHU 360SE "} # Use the IP Proxy Server proxy_handler = urllib2.ProxyHandler ({'HTTP ': 'http: //' + ip}) opener = urllib 2. build_opener (proxy_handler) urllib2.install _ opener (opener) req = urllib2.Request (url) for key in headers: req. add_header (key, headers [key]) html = urllib2.urlopen (req ). read () # convert other codes into unicode code unicodePage = html. decode ("UTF-8") # regular expression. The findall method returns a list of ss = re. findall ('"word ":\"(. *?) \ "', UnicodePage) for item in ss: print item # sleep for 2 seconds time. sleep (2)
Result:
Optimization considerations for the next version:
1. allow users to enter key I-words on their own. do not define a keyword list in advance.
2. Press enter to enter the next keyword.
3. Save the txt text in the output result
4. the user enters exit and the program exits.