Follow Daniel's steps to simply crawl the watercress film top250. The process impressions are recorded.
Tutorial Links: Here
After the crawler has finished writing, there has been an error
Attributeerror: ' Nonetype ' object has no attribute ' find ' # ' Nonetype ' objects without attributes found
Cause of error: The calling object is None and then an error is called. The error line was originally on page 10th without the next page of the link when judging soup.find (' span ', attrs={' class ', ' Next '). Find (' a ') the result of this sentence to choose the return result, to achieve the optimization principle, But I wrote a sentence.
Import requests
From BS4Import BeautifulSoup
def download_page (URL):
Date = Requests.get (URL). Content
Soup = BeautifulSoup (date,' lxml ')
Movie_list_soup = Soup.find (' Ol ',attrs={' Class ',' Grid_view '})
Movie_name_list = []
For Movie_liIn Movie_list_soup.find_all (' Li '):
Detail = Movie_li.find (' Div ',attrs={' Class ',' HD '})
Movie_name = Detail.find (' Span ',attrs={' Class ',' title '})
Movie_name_list.append (Movie_name.text)
next_page = down_url+ soup.find (' span ',attrs={' class ',' Next '). Find ( ' a ') [ ' href '] #出错在这里
If Next_page:
return movie_name_list,next_page
return movie_name_list,none
Down_url = ' https://movie.douban.com/top250 '
url = down_url
with open (" g://movie_name_ Top250.txt ', ' W ') as f:
while URL:
Movie,url = download_page (URL)
download_page (URL)
F.write (str (movie))
This is given in the tutorial, learn a bit
#!/usr/bin/env python#Encoding=utf-8"""crawl the Watercress movie TOP250-full sample code"""ImportCodecsImportRequests fromBs4ImportBeautifulsoupdownload_url='http://movie.douban.com/top250/'defdownload_page (URL):returnRequests.get (URL, headers={ 'user-agent':'mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) applewebkit/537.36 (khtml, like Gecko) chrome/47.0.2526.80 safari/537.36'}). Contentdefparse_html (HTML): Soup=BeautifulSoup (HTML) movie_list_soup= Soup.find ('ol', attrs={'class':'Grid_view'}) Movie_name_list= [] forMovie_liinchMovie_list_soup.find_all ('Li'): Detail= Movie_li.find ('Div', attrs={'class':'HD'}) Movie_name= Detail.find ('span', attrs={'class':'title'}). GetText () movie_name_list.append (movie_name) next_page= Soup.find ('span', attrs={'class':'Next'}). Find ('a') ifNext_page:returnMovie_name_list, Download_url + next_page['href'] returnmovie_name_list, Nonedefmain (): URL=Download_url with Codecs.open ('Movies','WB', encoding='Utf-8') as FP: whileurl:html=download_page (URL) movies, url=parse_html (HTML) fp.write (U'{movies}\n'. Format (movies='\ n'. Join (Movies )))if __name__=='__main__': Main ()
Feel Web crawler with Python-03. Watercress movie TOP250