The code is as follows:
fromUrllib.requestImportUrlopen fromBs4ImportBeautifulSoupImportReImportdatetimeImportRandomImportpymysql.cursors#Connect to the databaseConnection = Pymysql.connect (host='127.0.0.1', Port=3306, the user='Root', Password=' Database Password', DB='Scraping', CharSet='UTF8MB4', Cursorclass=pymysql.cursors.DictCursor) cur=connection.cursor () random.seed (Datetime.datetime.now ())defStore (title,content): Cur.execute ("INSERT into pages (title,content) VALUES (\ "%s\", \ "%s\")", (title,content)) Cur.connection.commit ()defgetlinks (articleurl): HTML= Urlopen ("http://en.wikipedia.org"+articleurl) Bsobj= BeautifulSoup (HTML,"Html.parser") Title= Bsobj.find ("H1"). Get_text ()Print(title) content= Bsobj.find ("Div",{"ID":"Mw-content-text"}). Find ("P"). Get_text ()Print(content) store (title,content)returnBsobj.find ("Div",{"ID":"bodycontent"}). FindAll ("a", Href=re.compile ("^ (/wiki/) ((?!:).) *$") links= Getlinks ("/wiki/kevin_bacon")Try: whileLen (links) >0:newarticle= Links[random.randint (0, Len (links)-1)].attrs["href"] #print (newarticle)Links =getlinks (newarticle)finally: Cur.close () connection.close ()
Results
Note:
Since we will encounter all kinds of characters on Wikipedia, it is best to have the database support Unicode with the following four statements:
Alter DatabaseScrapingcharacter Set =UTF8MB4 Collate=Utf8mb4_unicode_ci; Alter TablePagesConvert to character Set =UTF8MB4 Collate=Utf8mb4_unicode_ci; Alter TablePages Change title titlevarchar( $)character Set =UTF8MB4 Collate=Utf8mb4_unicode_ci; Alter TablePages Change Content contentvarchar(10000)character Set =UTF8MB4 Collate=Utf8mb4_unicode_ci;
Crawl Wikipedia personas and store them in a database using Pymysql