Python crawler: Case one: index

Last Update:2016-05-13 Source: Internet

Author: User

Tags virtual environment

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

pip installbeautifulsoup4 pip install requests pip install Selenium DownloadPhantomjs (Phantoms is a non-interface browser, used to parse the JS code) install Firebug for Firefox
Create a directory named BAIDUPC cd BAIDUPC Create a virtual environmentvirtualenv MACP
activating a virtual environmententer command under Macp/scriptsActivate
Enter /macp/bin under MacSourceActivate
The advantage of the virtual environment is that the environment is independent, can be casually toss and do not affect their original environment
unzip the Phantomjs-2.1.1-macosx.zip to the bin directory.Phantoms Copy to Baidupc/macp/bin below(PHANTOMJS Download different compression packages according to different systems, the directory of the virtual environment under Windows should be Baiducp\macp\script)

Case 1: Index data display is very intuitive, you can enter a keyword on the homepage and then look at the URL to knowURL:http://index.so.com/#trend? q= Ode to Joy (in fact the real URL is: http://index.so.com/#trend? q=%e6%ac%a2%e4%b9%90%e9%a2%82, the following Chinese to encode) Add a date query with three URLs: http://index.so.com/#trend? q=%e6%ac%a2%e4%b9%90%e9%a2%82&t=7http://index.so.com/#trend? q=%e6 %ac%a2%e4%b9%90%e9%a2%82&t=30http://index.so.com/#trend? q=%e6%ac%a2%e4%b9%90%e9%a2%82&t=201603|201605
Knowing this information, we can then find the data node we need from the HTML to know how to get the basic data.

#coding =utf-8import sysreload (SYS) sys.setdefaultencoding ("Utf-8") Import urllibfrom Selenium import Webdriverclass Qh        (): def PC (seif,name,date= ' 7 '): # time default 7 days, parameter ' 30 ' for 30 days, ' 201603|201605 ' for self-timer month Url_name=urllib.quote (name) #urllib.quote () encode the Chinese URL url= ' http://index.so.com/#trend? q= ' +url_name+ ' &t= ' +date driver = webdriver. PHANTOMJS () #webdriver. PHANTOMJS () Call Phantomjs browser driver.get (URL) sszs=driver.find_element_by_xpath ('//*[@id = ' Bd_overview ']/div[2] /TABLE/TBODY/TR/TD[1]. Text Sszshb=driver.find_element_by_xpath ('//*[@id = ' Bd_overview ']/div[2]/table/tbody/tr/ TD[2]. Text Sszstb=driver.find_element_by_xpath ('//*[@id = "Bd_overview"]/div[2]/table/tbody/tr/td[3] '). Text # Search index, search index chain, search index yoy (all national data) Driver.quit #quit () Close return sszs+ ' | ' +sszshb+ ' | ' +sszstb S=QH () Print s.pc (' Ode to Joy ') print s.pc (' Ode to Joy ', ' ') ' Print s.pc (' Ode to Joy ', ' 201603|201605 ')

Results: 1,392,286|36.28%|>1000%
657,310|>1000%|>1000% 657,310|>1000%|>1000%(Here is a very interesting place, the 201603|201605 data displayed on the Internet is the same as the 7-day data, and I climbed down the data 201603|201605 and 30 days, of course, according to common sense should be consistent with 30 days of data, but the page display does not know why and 7 days consistent)
The above is the simplest version, observe the index of the page will find the data can be selected region, by default, the above programs crawl to only the national data, want to get the data everywhere, we need to analyze the openFirefox firebug Click ' Change ', select ' Zhejiang ', will find that this is an AJAX, we find in the Firebug xhr, we found a get address, we will copy the URL to the browser will get a JSON, which is the data returned by Ajax url:http://index.so.com/index.php?a=overviewjson&q= Ode to &area= Zhejiang (actually true Url:http://index.so.com/index.php?a =overviewjson&q=%e6%ac%a2%e4%b9%90%e9%a2%82&area=%e6%b5%99%e6%b1%9f) returns the JSON data:


{"Status": 0, "data": [{"Query": "\u6b22\u4e50\u9882", "data": {"Week_year_ratio": ">1000%", "Month_year_ratio": " >1000% "," Week_chain_ratio ":" 31.52% "," Month_chain_ratio ":" >1000% "," Week_index ": 97521," Month_index ": 47646} }], "MSG": false}

We'll find only seven days and one months ofsearch index, search index mom, search index YoY
Code:


#coding =utf-8import sysreload (SYS) sys.setdefaultencoding ("Utf-8") Import urllibfrom Selenium import Webdriverclass Qh ():    def pc (seif,name,dq= ' National '):        url_name=urllib.quote (name)        Dq_name=urllib.quote (DQ)        url= ' http:/ /index.so.com/index.php?a=overviewjson&q= ' +url_name+ ' &area= ' +dq_name        driver = webdriver. PHANTOMJS ()         driver.get (URL)        json=driver.find_element_by_xpath ('/html/body/pre '). Text        Driver.quit        return Jsons=qh () print s.pc (' Ode to Joy ') print s.pc (' Ode to Joy ', ' Zhejiang ')

Results:


{"Status": 0, "data": [{"Query": "\u6b22\u4e50\u9882", "data": {"Week_year_ratio": ">1000%", "Month_year_ratio": " >1000% "," Week_chain_ratio ":" 36.28% "," Month_chain_ratio ":" >1000% "," Week_index ": 1392286," Month_index " : 657310}], "MSG": false}{"status": 0, "data": [{"Query": "\u6b22\u4e50\u9882", "data": {"Week_year_ratio": ">1000%" , "Month_year_ratio": ">1000%", "Week_chain_ratio": "31.52%", "Month_chain_ratio": ">1000%", "Week_index": 97521 , "Month_index": 47646}}], "MSG": false}

week_year_ratio 7-day search index YoYMonth_year_ratio 30-day Search index YoYWeek_chain_ratioSearch index for 7 days mommonth_chain_ratio for 30 days search Index MOMWeek_index 7-day Search indexMonth_index 30-day Search index
If you want to crawl a bunch of keywords, you can write a configuration file, and then loop to crawl the dataIf you want only the data you want, not JSON, you can parse the JSON and store it in a database or file .
about trend graphs and concerns, I don't know how to get the data, the demand map, and the demographic characteristics. I looked at the data. Can be obtained from HTML
Generally this data analysis to write good code a day or so, if you want to perfect the code should take about two or three days, test code that is not good to say, the code may hide some of their own ignored bugs, after all, testing is the most time-consuming thing

Python crawler: Case one: index

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More