Tag: The result of the null port file is CTO. com feature get page
The source code is attached with comments, directly put the source ha.
#-*-coding:utf8-*- fromlxmlImportetree fromMultiprocessing.dummyImportPool as ThreadPoolImportRequestsImportJSON#these three lines are used to solve the coding problem.Importsysreload (SYS) sys.setdefaultencoding ('Utf-8')" "Remove Content.txt before you rerun, because file operations use Append mode, which causes too much content. " "#the method is to write content to the file in the following formatdefTowrite (contentdict): F.writelines (U'Reply time:'+ STR (contentdict['Topic_reply_time']) +'\ n') f.writelines (U'Reply content:'+ Unicode (contentdict['topic_reply_content']) +'\ n') f.writelines (U'Reply Person:'+ contentdict['user_name'] +'\ n') _header={'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.101 safari/537.36'}#How to extract content from a given URLdefspider (URL): HTML= Requests.get (url,headers=_header)PrintURL Selector=etree. HTML (Html.text)#get all the content of this buildingContent_field = Selector.xpath ('//div[@class = "L_post j_l_post l_post_bright"]') Item= {} #traverse the building . foreachinchContent_field:" "data-field= "{" Author ": {" user_id ": 830583117," user_name ":" huluxiao855 ", "Name_u": "Huluxiao855&ie=utf-8", "User_sex": 0, "Portrait": "4db168756c757869616f383535 8131 "," Is_like ": 1," level_id ": 4," Level_name ":" \u719f\u6089\u82f9\u679c "," CU R_score ":", "Bawu": 0, "props": null}, "content": {"post_id": 62881461599, "Is_anonym": false, "open_id": "Tbclient", "Open_type": "Apple", "date": "2015-01-11 22:09" , "Vote_crypt": "", "Post_no": 203, "type": "0", "comment_num": 1, "Pty PE ":" 0 "," is_saveface ": false," props ": null," Post_index ": 0," pb_tpoint ": null } }" " "Reply_info= Json.loads (Each.xpath ('@data-field') [0].replace ('"',"')) #Reply_info is a dictionary, based on the structural relationships described in the comments above, to obtain aAuthor = reply_info['author']['user_name'] Content= Each.xpath ('div[@class = "D_post_content_main"]/div/cc/div[@class = "D_post_content j_d_post_content"]/text ()') [0] Reply_time= reply_info['content']['Date'] PrintcontentPrintReply_timePrintauthor item['user_name'] =author item['topic_reply_content'] =content item['Topic_reply_time'] =reply_time towrite (item)" "if we are directly executing a. py file, the file then "__name__ = = ' __main__ '" is true, but if we import the file through import from another. py file, then __name__ The value is the name of our py file, not the __main__. This feature also has a use: When debugging the code, in "if __name__ = = ' __main__ '" to add some of our debugging code, we can let the external module call when not executing our debugging code, but if we want to troubleshoot the problem, directly execute the module file, Debug generation The code can run properly! " "if __name__=='__main__': #Create a 4-core application poolPool = ThreadPool (4) #the second parameter, a, means appending to the filef = open ('Content.txt','a') #defines an array that holds URL URLspage = [] #append URLs to arrays by looping forIinchRange (1,21): NewPage='http://tieba.baidu.com/p/3522395718?pn='+Str (i) page.append (newpage)#Multi-threaded crawler methodResults =Pool.map (spider, page) Pool.close () Pool.join () f.close ()
PYTHON-18: Multi-threaded pick Baidu paste post content source code