Element review of JSON-crawled files with BeautifulSoup to see if a P tag is included
#-*-coding:utf-8-*- fromlxmlImportHTMLImportRequestsImportJSONImportReImportscrapy fromBs4ImportBeautifulSoup#parsing documents with BeautifulSoupdefbs4analysis (html_doc): Soup= BeautifulSoup (Html_doc,"lxml") ifSoup.find_all ('a'): Printsoup.a.stringPrintsoup.a.nextsiblingelifHtml_doc.find ('#') >=0:Print 'have a theme'P=re.split ('#', Html_doc)Print 'P0'+P[0]Print 'P1'+p[1] Print 'P2'+p[2] Else: Print 'haha'Html_doc=" "<a class= ' k ' href= ' Https://m.weibo.cn/k/SHU graduation season? from=feed ' > #毕业季 #</a> Cloud Blessing! I wish all 2017 graduates and the global people a bright future. <a data-url= "http://t.cn/RootR20" href= "https://m.weibo.cn/p/index?containerid= 230444def4f80e7a017ab35b3e37cadc001f32&url_type=39&object_type=video&pos=1&luicode=10000011 &lfid=1076033243026514&featurecode=20000320&ep=f9u8aqkyn%2c3243026514%2cf9u8aqkyn%2c3243026514 " Data-hide= "" ><span class= "Url-icon" ></span></i><span class=" Surl-text "> Sec Video </a>" "html_doc2=" "#早安 # million Wood Agnoy Xinyu, Xiaofeng before the Awakening, Four Seasons lovely only spring, a thing can be crazy young. --Wang Guowei" "HTML_DOC3=" "<a class= ' k ' href= ' https://m.weibo.cn/k/notice? from=feed ' > #通知公告 #</a> South District bathroom in the male area due to burst water mains, 2 bathrooms will be closed from today, Will you please arrange your response early? " "html_doc4=" "I made the headlines: the signing of the Academy of Fine Arts and the inauguration ceremony of Shanghai Wusong International Art City Development Institute <a data-url= "" Http://t.cn/RK2rQFs "" href= "" http://media.weibo.cn/ Article?object_id=1022%3a2309404126988389488631&url_type=39&object_type=article&pos=1&luicode= 10000011&lfid=1076033243026514&id=2309404126988389488631&ep=fbk5fbymp%2c3243026514%2cfbk5fbymp% 2c3243026514 "" Data-hide= "" "" ><span class= "" Url-icon "" ></span></i><span class= "" Surl-text "" > Signing of the Academy of Fine Arts and Shanghai Wusong International Art City Development Institute opening ceremony held </a>???" "html_doc5=" "<a class= ' k ' href= ' Https://m.weibo.cn/k/SHU share? from=feed ' > #分享 #</a> earthshaking, years flies <span class= "" Url-icon "" ></span>" "if __name__=='__main__': F= Open ('Shuweibo.txt','R') FH= Open ('Analysis.txt','a') whileTrue:line=F.readline ()ifline = ="': Break Print '*******************'bs4analysis (line)Print '*******************'f.close () fh.close ( )
7-13 Reptile Entry BeautifulSoup parsing of crawling content of Web pages