[Python] uses the request library to process the HTTP protocol-collect data from the Beihang declaration wall

Source: Internet
Author: User

recently read the "Graphic http" this book. Although it is called "plot", it is really useless to tell the truth. But the book is still good, after two hours of jumping read, the HTTP protocol has a general understanding of the. For me not to engage in front-end development, this knowledge should be enough.

Continue the Python toss-up journey!

Requests is the only Non-gmo HTTP Library for Python with safe for human consumption.

Warning: Recreational use of other HTTP libraries could result in dangerous side-effects, including:security vulnerabilities, Verbos e Code, reinventing the wheel, constantly reading documentation, depression, headaches, or even death.

(Excerpt from requests official documents, I think this is the worst urllib to be black 233)

1. Analysis

The highest utilization of the declaration wall system is called "Beihang micro-life" of the public number (seemingly a lot of universities have corresponding "XX micro-life" public number, and the operating structure is basically the same, not very understand why ...).

Because the public platform does not directly open APIs, and access has a complex set of authentication mechanisms, it is difficult to crawl directly through a link. Online use of more than the practice is through Sogou search http://weixin.sogou.com/API to implement the crawl indirectly.

This is similar to the crawler that had previously done to crawl the academic news of Beihang University.

But it's not exactly the same, there are some other areas

    • Sogou page has access restrictions, not logged in the words can only access the first 10 pages of search results
    • Sogou has anti-crawling measures, frequent access will trigger the verification code

In the spirit of not introducing too many issues into a small project, take the following simple but not graceful solution

    • Now manually login in the browser, and save the cookies, so that the crawler with this cookie to access
    • Time.sleep
2. Coding and debugging

Requests's biggest advantage lies in its introduction to the syntax. Constructing a standard headers is so simple!

#_*_ coding:utf-8_*_ImportRequestsImportReImportSYSImporttimereload (SYS) sys.setdefaultencoding ('Utf-8') Pattern_url= Re.compile ('^<a href= "(. *?)" target= "_blank" id= "sogou_vr_11002601_title_." uigs_exp_id=', Re. s|Re. M) Datepattern=re.compile ("<strong><span style= ' color:\ rgb\ (112, 48, 160\); ' > (. *?) </span>", Re. S) DatePattern2=re.compile ("</p><p> (. *?) </p><p><br/>", Re. S) Datepattern_time=re.compile ('<em id= "post-date" class= "Rich_media_meta Rich_media_meta_text" > (. *?) </em>', Re. S) subpattern_img=re.compile ('', Re. S) Subpattern_amp= Re.compile ('&amp;'); Head={             'Host':'weixin.sogou.com',             'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) gecko/20100101 firefox/49.0',             'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',             'Accept-language':'zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3',             'accept-encoding':'gzip, deflate',             'Cookies':'cxid=e54348bdd19c7be40d8bb78fb87f6f1d; [email protected]@@@@@@@@@; suid=eabb5c2a4d6c860a577ddc8d000a87f0; iploc=cn1100; suv=1467948688365866; goto=af99046; ssuid=7968034436; sct=35; Snuid=40b49665595d1827aee18bc259c7fbda; pgv_pvi=7047098368; ABTEST=4|1478525186|V1; weixinindexvisited=1; ppinf=5|1478526187|1479735787| Dhj1c3q6mtoxfgnsawvudglkojq6mjaxn3x1bmlxbmftztoyoknofgnyddoxmdoxndc4nti2mtg3fhjlzm5py2s6mjpdtnx1c2vyawq6ndq6n0i0rercqtbfm 0ezrtu1ndnfqtmzmurcquy3mdleotlacxeuc29ods5jb218; pprdig= Eqg4qn0r5nfwn4njvxwedfq5l3xxkltzicwyn-ftxclrhyamj-b7kqoxebnojbgzeifezhosalgp0koeudfmeaoo6kdb7bjivf4o9i8saiujialujk5xwr6fe N4cloegwrbjh4_oxzehalrzif5l_tyb1lrhyqdgdlzssgoeaeu; [email protected]@@@@@@@@@@; ppmdig=1479459994000000988a368ffbfc7c85801f8b1c32470843; JSESSIONID=AAAPOOVX2NPOFDW2MJWFV; PHPSESSID=BJQ68KGCQC3PHOD5J3EUKCSIE0; Suir=40b49665595d1827aee18bc259c7fbda; pgv_si=s1257009152; seccodeerrorcount=1| Fri, 09:14:27 GMT; successcount=1| Fri, 09:14:34 GMT; lstmv=887%2c229; lclkint=5042',}proxies= {  "http":"http://116.252.158.157:8998",}Print 'HELLO' forPAGEinchRange (1,35):     Print 'PAGE'+Str (PAGE) Searchurl='Http://weixin.sogou.com/weixin?query=%E5%8C%97%E8%88%AA%E8%A1%A8%E7%99%BD%E5%A2%99&_sug_type_=&sut =805&lkt=0%2c0%2c0&_sug_=y&type=2&sst0=1479460274521&page='+str (PAGE) +'&ie=utf8&w=01019900&dr=1'SearchResult= Requests.get (searchurl,headers=head) OBJ=Re.findall (Pattern_url,searchresult.text) forIinchObj:url= Re.sub (Subpattern_amp,"&", i) page=requests.get (URL) time=Re.findall (datepattern_time,page.text) FILE=open ("'. Join (Time) +'. txt','W')        Print "'. Join (time) m=Re.findall (Datepattern,page.text)ifLen (m):Pass        Else: M=Re.findall (Datepattern2,page.text) forKinchm:date=re.sub (Subpattern_img,"<EMOJI>", K) File.write (DATE) file.write ('\ n') Time.sleep (10)                           

The crawl speed is not fast ... But to avoid being reversed, this is the only way.

3. PostScript

Python is really getting interesting. Of course, in addition to using a variety of libraries to achieve interesting features, I do need to strengthen the basic syntax of Python:

The basic play method will be almost the same. The next step is to play.

    • Efficient Data Cleansing (Beautifulsoap Xpath xml ...)
    • Automatic Verification Code processing (tesseract, machine learning algorithm ...)
    • Automated Testing (SELENIUM2)
    • More Web theory knowledge (Javascript Jquery CSS)
    • Database Technology (SQL)

Hope to end the practice of data collection as soon as possible. After all, data analysis is the highlight.

[Python] uses the request library to process the HTTP protocol-collect data from the Beihang declaration wall

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.