Python crawlers collect 360 of search Lenovo words

Source: Internet
Author: User
Tags keyword list

There was a video dedicated to this crawler, but it was too cool. By the way, it had been sorted out, and up to now, 360 was not silly and had evolved. It was a bit of a bug to use the original method, this will be explained later. The subject is as follows:

Language: python2.7.6

Module: urllib, urllib2, re, time

Objective: To input arbitrary words and capture their associated words

Version: w1

Principle: On the 360 page, search for the home page: Taobao. Before entering a keyword, right-click the home page and select "review element" -- "Network" -- "Name". After Entering the keyword, the corresponding hyperlink will appear below, we only observe "Headers" "Priview" and "Headers". We can see "Request URL" and header information (host, proxy, and so on ), "Priview" shows an example of my input:

Suggest_so ({"query": "technology", "result": [{"word": "technology Aesthetics" },{ "word": "Technology Court "}, {"word": "Science and Technology Department" },{ "word": "Science and Technology Management Research" },{ "word": "", "obdata ": "{\" t \ ": \" video \ ", \" d \ ": [2, \" http: \ // define qhimg.com \/d \/examples \ ", \" \ u9ad8 \ u79d1 \ u6280 \ u5c11 \ u5973 \ u55b5 \ ", \" http: \ // www.360kan.com \/TV \/q4pwa7lrg4lnn.html \ ", 3,12]}" },{ "word": "technology daily" },{ "word ": "Major Advantages and Disadvantages of technological development" },{ "word": "super powerful technology" },{ "word": "Technology Network" },{ "word ": "scientific and technological progress and countermeasures"}], "version": ""});

Obviously, we just need to catch the word inside, forget to explain, in the Request URL, there is a link: http://sug.so.360.cn/suggest? Callback = suggest_so & encodein = UTF-8 & encodeout = UTF-8 & format = json & fields = word, obdata & word = % E7 % A7 % 91% E6 % 8A % 80% 20, we entered many times and found that it was just "% E7 % A7 % 91% E6 % 8A % 80% 20". That is to say, the previous part remains unchanged and can be used directly, the subsequent part changes with the input keyword. This is a URL encoding, which can be used by urllb. quote () method implementation.

Operation: 1. Add header information to read the webpage. Related Methods: urllib2.Request (), urllib2.urlopen (), urllib2, urlopen (). read ()

2. Regular Expression matching: Method: Describes the usage of the re module ..

The Code is as follows:

# Coding: utf-8import urllibimport urllib2import reimport timegjc = urllib. quote ("tech") url = "http://sug.so.360.cn/suggest? Callback = suggest_so & encodein = UTF-8 & encodeout = UTF-8 & format = json & fields = word, obdata & word = "+ gjcprint urlreq = urllib2.Request (url) html = urllib2.urlopen (req ). read () unicodePage = html. decode ("UTF-8") # regular expression. The findall method returns a list of ss = re. findall ('"word ":\"(. *?) \ "', UnicodePage) for item in ss: print item
Result:

If unicodePage = html. decode ("UTF-8"), the returned values will be interspersed with garbled characters. Let's verify if we are right. Open 360 and enter "technology". The result is as follows:

Don't worry about the sequence of the first and second related words. The result of my second request is changed. The result of the second request is changed. It may be that 360 is changing, you can try other keywords.

Well, the general framework has been implemented. This is an initial version and cannot be used without restrictions. What we need to do is smooth. What are the problems? <喎?http: www.bkjia.com kf ware vc " target="_blank" class="keylink"> VcD4KPHA + signature/Signature + NW + u + signature/bOjtPK/Signature + CjxwPjIux + vH87n9v + zSsr/Signature/Su7TOx + signature + bQ3c + i0rvPwqOs1eK + zcrHdGltZS5z Coding/coding + coding = "brush: java;"> # coding: UTF-8 # ------------------- # program: crawler collection 360 search related words # language: python2.7 # version: w1 # Time: 2014-06-14 # Author: wxx # --------------------- import urllibimport urllib2import re Import timefrom random import choice # ip proxy list iplist = ["14.29.117.36: 80", "222.66.115.229: 80", "59.46.72.245: 8080"] ip = choice (iplist) # print ip # keyword list, sequential search list = ["group", "technology", "python"] for m in list: # quote converts m to URL encoding gjc = urllib. quote (m) url = "http://sug.so.360.cn/suggest? Callback = suggest_so & encodein = UTF-8 & encodeout = UTF-8 & format = json & fields = word, obdata & word = "+ gjc # header information headers = {" GET ": url, "Host": "sug.so.360.cn", "Referer": "http://www.so.com/", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 QIHU 360SE "} # Use the IP Proxy Server proxy_handler = urllib2.ProxyHandler ({'HTTP ': 'http: //' + ip}) opener = urllib 2. build_opener (proxy_handler) urllib2.install _ opener (opener) req = urllib2.Request (url) for key in headers: req. add_header (key, headers [key]) html = urllib2.urlopen (req ). read () # convert other codes into unicode code unicodePage = html. decode ("UTF-8") # regular expression. The findall method returns a list of ss = re. findall ('"word ":\"(. *?) \ "', UnicodePage) for item in ss: print item # sleep for 2 seconds time. sleep (2)
Result:

Optimization considerations for the next version:

1. allow users to enter key I-words on their own. do not define a keyword list in advance.

2. Press enter to enter the next keyword.

3. Save the txt text in the output result

4. the user enters exit and the program exits.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.