Cat's Eye Movie Crawl (ii): Requests+beautifulsoup and store data in MySQL database

Last Update:2018-06-26 Source: Internet

Author: User

Tags webp

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The last article through the requests+ to crawl the cat's-eye movie list, this time through the Requests+beautifulsoup to crawl again (in fact, this site is more suitable to use the BeautifulSoup library crawl)

1. Analyze the Web page source code First

You can see that each movie message is contained in a bunch of <dd>...</dd> tags, so the first step is to parse all the <dd> tag pairs through the BeautifulSoup library and then from the <dd> The <i> tag in which the position of the tag is parsed, the <p> tag where the movie name is located, the <p> tag where the release time is located, and the <p> tag where the score is located.

2. Information Extraction Code

#Coding:utf-8#AUTHOR:HMK fromBs4ImportBeautifulSoupImportRequestsImportBs4url='HTTP://MAOYAN.COM/BOARD/4'Header= {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",              "accept-encoding":"gzip, deflate, SDCH",              "Accept-language":"zh-cn,zh;q=0.8",              "Cache-control":"max-age=0",              "Connection":"keep-alive",              "Host":"maoyan.com",              "Referer":"Http://maoyan.com/board",              "upgrade-insecure-requests":"1",              "user-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.75 safari/537.36"}r= Requests.get (URL, headers=header) R.encoding=r.apparent_encodinghtml=R.textsoup= BeautifulSoup (HTML,'Html.parser')#Print (Soup.find_all (' DD '))List=[]#define a list, save all the movie data, do not define in the loop, or each time will be emptied, and finally will only leave the last movie data  forDdinchSoup.find_all ('DD'): Index= Dd.i.string#Movie Rankings    #Print (index)Movie = Dd.find ('P', class_='name'). String#Movie Name    #print (movie.string)Release_times = Dd.find ('P', class_='Releasetime')#Release TimeRelease_time =release_times.string#print (release_time.string)s = Dd.find ('P', class_='score'). Contents#scoreScore = s[0].string+s[1].string#stitching the integer and fractional parts of fractions    #print (score[0].string+score[1].string)list.append ([Index,movie,release_time,score])#add each movie's rank, name, release time, score to a list, and then append to a larger listPrint(list)

the focus of the above code is how the information in the for loop is extracted and then combined, with the following ideas: (1) First extract all the <dd> tag pairs from the page, and use the For loop to set each group <dd>The tag is assigned to a DD variable, and each DD variable is a tag object of a BS4 element;2) After getting the return object of the DD tag, you can use the Find method to extract the sub-tags of the DD tag (beginning to fall into a misunderstanding, because the printed DD content is the label element, and then think about whether it can be passed into the BeautifulSoup,
To generate a new BeautifulSoup object, the actual proof is no, because the type of DD is already <class 'Bs4.element.Tag'>, and the type of html=r.text that was previously passed in is <class 'Str'>, obviously can't do this!!
So just print out the object type when you can't figure it out .3) Extraction Ranking--use DD.I.STRING,DD.I to extract the first I tag under the DD label, just the first I tag under the DD tab, plus. String, which represents the extracted text (4) Extract the name of the movie-Using Dd.find ('P', class_='name'). String, extract the P label for the class property under the DD tag, because the movie name is under the P tag (5) Extract release time-Using Dd.find ('P', class_='Releasetime')(6) Extract the score-because the score is divided into 2 parts, the integer part and the fractional part, and belongs to the I label under a P tag respectively, so that the Tag.contents method (tag's. Contents property can output the child nodes of the tag as a list).
The 2 parts are then stitched together to form a complete score, as follows: Dd.find ('P', class_='score'). Contents[0].string + Dd.find ('P', class_='score'). contents[1].string

3. Complete code

#Coding:utf-8#AUTHOR:HMKImportRequests fromBs4ImportBeautifulSoupImportpymysql.cursorsdefget_html (URL, header):Try: R= Requests.get (Url=url, Headers=header, timeout=20) r.encoding=r.apparent_encodingifR.status_code = = 200:            returnR.textElse:            returnNoneexcept:        returnNonedefget_data (HTML, list_data): Soup= BeautifulSoup (HTML,'Html.parser') DD= Soup.find_all ('DD')     forTinchdd:ranking= T.i.string#rankingMovie = T.find ('P', class_='name'). String Release_time= T.find ('P', class_='Releasetime'). String Score= T.find ('P', class_='score'). Contents[0].string + T.find ('P', class_='score'). contents[1].string list_data.append ([ranking, movie, Release_time, score])defwrite_sql (data): Conn= Pymysql.connect (host='localhost', the user='Root', Password='123456', DB='Test', CharSet='UTF8') cur=conn.cursor () forIinchData:"""The data parameter here is a list that matches and processes (is a large list that contains all the movie information, each of which has its own list; iterate over a large list to extract each set of movie information so that each group of movie information extracted is a small list, Then you can write each set of movie information to the database."""movie= I#For each set of movie information, this can be seen as a set of movie data to be inserted into the databasesql ="INSERT INTO Maoyan_movie (Ranking,movie,release_time,score) values (%s,%s,%s,%s)"  #SQL INSERT Statement        Try: Cur.execute (sql, Movie)#execute the SQL statement, which refers to the data to be inserted into the databaseConn.commit ()#do not forget to submit the action after the insert is complete            Print('Import succeeded')        except:            Print('Import Failed') Cur.close ()#Close CursorsConn.close ()#Close ConnectiondefMain (): Start_url='HTTP://MAOYAN.COM/BOARD/4'Depth= 10#Crawl Depth (page flip)Header = {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",              "accept-encoding":"gzip, deflate, SDCH",              "Accept-language":"zh-cn,zh;q=0.8",              "Cache-control":"max-age=0",              "Connection":"keep-alive",              "Host":"maoyan.com",              "Referer":"Http://maoyan.com/board",              "upgrade-insecure-requests":"1",              "user-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.75 safari/537.36"}     forIinchRange (depth): URL= Start_url +'? offset='+ STR (10 *i) HTML=get_html (URL, header) List_data=[] get_data (HTML, List_data) Write_sql (list_data)Print(List_data)if __name__=="__main__": Main ()

Cat's Eye Movie Crawl (ii): Requests+beautifulsoup and store data in MySQL database

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More