PYTHON-18: Multi-threaded pick Baidu paste post content source code

Source: Internet
Author: User

Tag: The result of the null port file is CTO. com feature get page

The source code is attached with comments, directly put the source ha.

#-*-coding:utf8-*- fromlxmlImportetree fromMultiprocessing.dummyImportPool as ThreadPoolImportRequestsImportJSON#these three lines are used to solve the coding problem.Importsysreload (SYS) sys.setdefaultencoding ('Utf-8')" "Remove Content.txt before you rerun, because file operations use Append mode, which causes too much content. " "#the method is to write content to the file in the following formatdefTowrite (contentdict): F.writelines (U'Reply time:'+ STR (contentdict['Topic_reply_time']) +'\ n') f.writelines (U'Reply content:'+ Unicode (contentdict['topic_reply_content']) +'\ n') f.writelines (U'Reply Person:'+ contentdict['user_name'] +'\ n') _header={'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.101 safari/537.36'}#How to extract content from a given URLdefspider (URL): HTML= Requests.get (url,headers=_header)PrintURL Selector=etree. HTML (Html.text)#get all the content of this buildingContent_field = Selector.xpath ('//div[@class = "L_post j_l_post l_post_bright"]') Item= {}    #traverse the building .     foreachinchContent_field:" "data-field= "{" Author ": {" user_id ": 830583117," user_name ":" huluxiao855 ", "Name_u": "Huluxiao855&ie=utf-8", "User_sex": 0, "Portrait": "4db168756c757869616f383535 8131 "," Is_like ": 1," level_id ": 4," Level_name ":" \u719f\u6089\u82f9\u679c "," CU            R_score ":", "Bawu": 0, "props": null}, "content": {"post_id": 62881461599, "Is_anonym": false, "open_id": "Tbclient", "Open_type": "Apple", "date": "2015-01-11 22:09" , "Vote_crypt": "", "Post_no": 203, "type": "0", "comment_num": 1, "Pty            PE ":" 0 "," is_saveface ": false," props ": null," Post_index ": 0," pb_tpoint ": null }        }"                " "Reply_info= Json.loads (Each.xpath ('@data-field') [0].replace ('&quot',"'))        #Reply_info is a dictionary, based on the structural relationships described in the comments above, to obtain aAuthor = reply_info['author']['user_name'] Content= Each.xpath ('div[@class = "D_post_content_main"]/div/cc/div[@class = "D_post_content j_d_post_content"]/text ()') [0] Reply_time= reply_info['content']['Date']        PrintcontentPrintReply_timePrintauthor item['user_name'] =author item['topic_reply_content'] =content item['Topic_reply_time'] =reply_time towrite (item)" "if we are directly executing a. py file, the file then "__name__ = = ' __main__ '" is true, but if we import the file through import from another. py file, then __name__    The value is the name of our py file, not the __main__. This feature also has a use: When debugging the code, in "if __name__ = = ' __main__ '" to add some of our debugging code, we can let the external module call when not executing our debugging code, but if we want to troubleshoot the problem, directly execute the module file, Debug generation    The code can run properly! " "if __name__=='__main__':    #Create a 4-core application poolPool = ThreadPool (4)    #the second parameter, a, means appending to the filef = open ('Content.txt','a')    #defines an array that holds URL URLspage = []    #append URLs to arrays by looping     forIinchRange (1,21): NewPage='http://tieba.baidu.com/p/3522395718?pn='+Str (i) page.append (newpage)#Multi-threaded crawler methodResults =Pool.map (spider, page) Pool.close () Pool.join () f.close ()

PYTHON-18: Multi-threaded pick Baidu paste post content source code

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.