Example:http://xyzp.haitou.cc/article/722427.html
The first is to download each page directly, you can use Os.system ("wget" +str (URL)) or Urllib2.urlopen (URL), very simple not to repeat.
Then, the plays, to extract information:
#!/usr/bin/env python#Coding=utf-8 fromBs4ImportBeautifulSoupImportCodecsImportSYSImportosreload (SYS) sys.setdefaultencoding ("Utf-8")ImportRe fromPymongoImportmongoclientdefget_jdstr (fname): Soup=""retdict={} with open (fname) as Fr:soup= BeautifulSoup (Fr.read (). Replace ('""','"')) Jdstr=soup.get_text () retdict["Inc_name"] =soup.title.string.split () [0] retdict["page_content"] = Soup.find_all ("Div","panel-body Panel-body-text") [0].get_text () retdict["Index_url"] = Re.search ("http://xyzp.haitou.cc/article/\d+.html", JDSTR). Group () retdict["Info_from"] = Soup.find_all ("P","text-ellipsis") [0].contents[1].get_text () retdict["Workplace"] = Soup.find_all ("P","text-ellipsis") [1].contents[1].get_text () retdict["Info_tag"] = Soup.find_all ("P","text-ellipsis") [2].contents[1].get_text () retdict["Pub_time"] = Soup.find_all ("P","text-ellipsis") [3].contents[1].get_text ()returnretdictdefjd_extr (): Fnames= [fname forFNameinchOs.listdir ("./")ifFname.endswith (". html")] FW= Codecs.open ("Tmp_jd_haitou_clean.csv","W","Utf-8") Res= [] forFNameinchfnames[1:500]: tmp=[] retdict=get_jdstr (fname) res.append (retdict) forKvinchRetdict.iteritems (): Tmp.append (v) fw.write (" , ". Join (TMP) +"\ n") Fw.write ("==="*20+"\ n") PrintFName"done!" returnResdefchange2html (): Fnames= [fname forFNameinchOs.listdir ("./")ifFname.endswith (". txt") ] forFNameinchFnames:cmd="MV"+str (fname) +" "+fname[:-3]+"HTML" Printcmd os.system (cmd)defStore2mongodb (): Client= Mongoclient ("localhost", 27017) DB=client. Jd_haitou Documents=jd_extr () forDinchDocuments:db.haitouJD.insert (d) MyCol= db["HAITOUJD"] PrintMycol.count ()defSplit_jd_test_data (fname='./tmp_jd_haitou_clean.csv'): FW= Codecs.open ('./split_jd_res.csv','W','Utf-8') Fr= Codecs.open (fname,'R','Utf-8') Indexurl= Re.compile ("http://xyzp.haitou.cc/article/\d+.html") forLineinchfr:ifIndexurl.search (line): URL=Indexurl.search (line). Group () CNT='1' #default is 1Fw.write (url+"\ t"+cnt+"\ n") Fr.close () fw.close ()if __name__=="__main__": Jd_extr () # Deposit File Store2mongodb () Split_jd_test_data ( )Print " Done"
Use BS4 to extract the content information from the network and deposit it into the MongoDB database