Python crawler: Case one: index

Source: Internet
Author: User
Tags virtual environment


pip installbeautifulsoup4 pip install requests pip install Selenium DownloadPhantomjs (Phantoms is a non-interface browser, used to parse the JS code) install Firebug for Firefox
Create a directory named BAIDUPC cd BAIDUPC Create a virtual environmentvirtualenv MACP
activating a virtual environmententer command under Macp/scriptsActivate
Enter /macp/bin under MacSourceActivate
The advantage of the virtual environment is that the environment is independent, can be casually toss and do not affect their original environment
unzip the Phantomjs-2.1.1-macosx.zip to the bin directory.Phantoms Copy to Baidupc/macp/bin below(PHANTOMJS Download different compression packages according to different systems, the directory of the virtual environment under Windows should be Baiducp\macp\script)

Case 1: Index data display is very intuitive, you can enter a keyword on the homepage and then look at the URL to knowURL:http://index.so.com/#trend? q= Ode to Joy (in fact the real URL is: http://index.so.com/#trend? q=%e6%ac%a2%e4%b9%90%e9%a2%82, the following Chinese to encode) Add a date query with three URLs: http://index.so.com/#trend? q=%e6%ac%a2%e4%b9%90%e9%a2%82&t=7http://index.so.com/#trend? q=%e6 %ac%a2%e4%b9%90%e9%a2%82&t=30http://index.so.com/#trend? q=%e6%ac%a2%e4%b9%90%e9%a2%82&t=201603|201605
Knowing this information, we can then find the data node we need from the HTML to know how to get the basic data.
#coding =utf-8import sysreload (SYS) sys.setdefaultencoding ("Utf-8") Import urllibfrom Selenium import Webdriverclass Qh        (): def PC (seif,name,date= ' 7 '): # time default 7 days, parameter ' 30 ' for 30 days, ' 201603|201605 ' for self-timer month Url_name=urllib.quote (name) #urllib.quote () encode the Chinese URL url= ' http://index.so.com/#trend? q= ' +url_name+ ' &t= ' +date driver = webdriver. PHANTOMJS () #webdriver. PHANTOMJS () Call Phantomjs browser driver.get (URL) sszs=driver.find_element_by_xpath ('//*[@id = ' Bd_overview ']/div[2] /TABLE/TBODY/TR/TD[1]. Text Sszshb=driver.find_element_by_xpath ('//*[@id = ' Bd_overview ']/div[2]/table/tbody/tr/ TD[2]. Text Sszstb=driver.find_element_by_xpath ('//*[@id = "Bd_overview"]/div[2]/table/tbody/tr/td[3] '). Text # Search index, search index chain, search index yoy (all national data) Driver.quit #quit () Close return sszs+ ' | ' +sszshb+ ' | ' +sszstb S=QH () Print s.pc (' Ode to Joy ') print s.pc (' Ode to Joy ', ' ') ' Print s.pc (' Ode to Joy ', ' 201603|201605 ')


Results: 1,392,286|36.28%|>1000%
657,310|>1000%|>1000% 657,310|>1000%|>1000%(Here is a very interesting place, the 201603|201605 data displayed on the Internet is the same as the 7-day data, and I climbed down the data 201603|201605 and 30 days, of course, according to common sense should be consistent with 30 days of data, but the page display does not know why and 7 days consistent)
The above is the simplest version, observe the index of the page will find the data can be selected region, by default, the above programs crawl to only the national data, want to get the data everywhere, we need to analyze the openFirefox firebug Click ' Change ', select ' Zhejiang ', will find that this is an AJAX, we find in the Firebug xhr, we found a get address, we will copy the URL to the browser will get a JSON, which is the data returned by Ajax url:http://index.so.com/index.php?a=overviewjson&q= Ode to &area= Zhejiang (actually true Url:http://index.so.com/index.php?a =overviewjson&q=%e6%ac%a2%e4%b9%90%e9%a2%82&area=%e6%b5%99%e6%b1%9f) returns the JSON data:

{"Status": 0, "data": [{"Query": "\u6b22\u4e50\u9882", "data": {"Week_year_ratio": ">1000%", "Month_year_ratio": " >1000% "," Week_chain_ratio ":" 31.52% "," Month_chain_ratio ":" >1000% "," Week_index ": 97521," Month_index ": 47646} }], "MSG": false}


We'll find only seven days and one months ofsearch index, search index mom, search index YoY
Code:

#coding =utf-8import sysreload (SYS) sys.setdefaultencoding ("Utf-8") Import urllibfrom Selenium import Webdriverclass Qh ():    def pc (seif,name,dq= ' National '):        url_name=urllib.quote (name)        Dq_name=urllib.quote (DQ)        url= ' http:/ /index.so.com/index.php?a=overviewjson&q= ' +url_name+ ' &area= ' +dq_name        driver = webdriver. PHANTOMJS ()         driver.get (URL)        json=driver.find_element_by_xpath ('/html/body/pre '). Text        Driver.quit        return Jsons=qh () print s.pc (' Ode to Joy ') print s.pc (' Ode to Joy ', ' Zhejiang ')


Results:

{"Status": 0, "data": [{"Query": "\u6b22\u4e50\u9882", "data": {"Week_year_ratio": ">1000%", "Month_year_ratio": " >1000% "," Week_chain_ratio ":" 36.28% "," Month_chain_ratio ":" >1000% "," Week_index ": 1392286," Month_index " : 657310}], "MSG": false}{"status": 0, "data": [{"Query": "\u6b22\u4e50\u9882", "data": {"Week_year_ratio": ">1000%" , "Month_year_ratio": ">1000%", "Week_chain_ratio": "31.52%", "Month_chain_ratio": ">1000%", "Week_index": 97521 , "Month_index": 47646}}], "MSG": false}



week_year_ratio 7-day search index YoYMonth_year_ratio 30-day Search index YoYWeek_chain_ratioSearch index for 7 days mommonth_chain_ratio for 30 days search Index MOMWeek_index 7-day Search indexMonth_index 30-day Search index
If you want to crawl a bunch of keywords, you can write a configuration file, and then loop to crawl the dataIf you want only the data you want, not JSON, you can parse the JSON and store it in a database or file .
about trend graphs and concerns, I don't know how to get the data, the demand map, and the demographic characteristics. I looked at the data. Can be obtained from HTML
Generally this data analysis to write good code a day or so, if you want to perfect the code should take about two or three days, test code that is not good to say, the code may hide some of their own ignored bugs, after all, testing is the most time-consuming thing


Python crawler: Case one: index


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.