python 爬蟲練習

來源:互聯網
上載者:User

標籤:write   內容   web   break   ali   rom   get   string   bs4   

bs去除特定標籤。

# urlimport easygui as gimport urllib.requestfrom bs4 import BeautifulSoupimport osimport sysimport reimport config.story2 as urls# 擷取urldef set_url():    msg = "請填寫一下資訊(其中帶*號的項為必填項)"    title = "爬蟲練習"    fieldNames = ["*小說目錄位址", "*組裝前半段", "後半段"]    fieldValues = []    fieldValues = g.multenterbox(msg, title, fieldNames)    while True:        if fieldValues == None:            break        errmsg = ""        for i in range(len(fieldNames)):            option = fieldNames[i].strip()            if fieldValues[i].strip() == "" and option[0] == "*":                errmsg += ("【%s】為必填項   " % fieldNames[i])        if errmsg == "":            break        fieldValues = g.multenterbox(errmsg, title, fieldNames, fieldValues)    return fieldValues# 下載網頁內容,找到文章標題和對應的下載路徑def get_urls(seed_url,pre_url,last_url):    # 儲存文章名稱和地址    storyList = {}    response = urllib.request.urlopen(seed_url)    html = response.read().decode(‘utf-8‘)    bs = BeautifulSoup(html, "html.parser")    contents = bs.find_all("div", {"class": "c-line-bottom"})    for each in contents:        # 或者文章的data-nsrc屬性        nsrc = each.a["data-nsrc"]        #組裝url        seed_url = pre_url+nsrc+last_url        # 擷取檔案標題        title = each.p.string        storyList[title] = seed_url    return storyList# 擷取每個小說並下載def getStory():    savepath = r"E:\\stories\\"    storyList = get_urls(urls.url1,urls.url2,urls.url3)    storyNames = list(storyList.keys())    for i in range(len(storyNames)):        # 擷取小說:        html = urllib.request.urlopen(storyList[storyNames[i]]).read().decode(‘utf-8‘)        bs = BeautifulSoup(html,"html.parser")        [s.extract() for s in bs(‘br‘)]   # 後來發現這個可以啊        content = bs.find_all(‘p‘)        #[ss.extract() for ss in content(‘p‘)]  # 放到這裡是否可以,發現不行。TypeError: ‘ResultSet‘ object is not callable        # # 用替換方式去掉br修飾,發現不行        # oldstr = r‘<br style="font-size:16px;font-weight:normal;‘ \        #          r‘margin-left:4px;margin-right:4px;float:none;color:rgb(0, 0, 0);‘ \        #          r‘text-align:-webkit-auto;text-indent:0px;white-space:normal;‘ \        #          r‘text-overflow:clip;clear:none;display:inline;"/>‘        #       # print(content)        with open(savepath+storyNames[i]+".txt",‘w‘) as f:             f.writelines(str(content))# download(get_url())# get_url()getStory()

 

python 爬蟲練習

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.