Find the "Wikipedia six-degree separation theory" method. That is to say, we are going to implement from the Edgar · Edel's entry page (Https://en.wikipedia.org/wiki/Eric_Idle) starts with a minimum number of link clicks to find Kevin · Bacon's entry page (Https://en.wikipedia.org/wiki/Kevin_Bacon).
You should already know how to write a Python code that gets any page from the Wikipedia site and extracts the page links:
fromUrllib.requestImportUrlopen fromBs4Importbeautifulsouphtml= Urlopen ("Http://en.wikipedia.org/wiki/Kevin_Bacon") Bsobj=BeautifulSoup (HTML) forLinkinchBsobj.findall ("a"):if 'href' inchLink.attrs:Print(link.attrs['href'])
If you look at the generated column of links, you'll see all the entry links you want: "Apollo 13"
"Philadelphia" and "Primetime Emmy Award", and so on. However, there are some links that we don't need:
Wikimediafoundation.org/wiki/privacy_policy
En.wikipedia.org/wiki/wikipedia:contact_us
In fact, every page in Wikipedia is filled with sidebar, header, footer links, and connections to category pages, dialogs
Links to pages and other pages that do not contain terms:
/wiki/category:articles_with_unsourced_statements_from_april_2014
/wiki/talk:kevin_bacon
I recently had a friend who was working on a project like Wikipedia, and he said that in order to judge the link in Wikipedia is
No link to an entry, he wrote a very large filter function, more than 100 lines of code. Unfortunately, it is possible that the item
When he started, he did not take the time to compare the differences between "entry links" and "other links", or he might later
Found the trick. If you look closely at the links that point to the entry page (not to other content pages),
Will find that they all have three things in common:
? They're all in the div tag with the ID bodycontent.
? URL link does not contain a semicolon
? URL links start with/wiki/
We can use these rules to tweak the code a little bit to get the entry link:
fromUrllib.requestImportUrlopen fromBs4ImportBeautifulSoupImportrehtml= Urlopen ("Http://en.wikipedia.org/wiki/Kevin_Bacon") Bsobj=BeautifulSoup (HTML) forLinkinchBsobj.find ("Div", {"ID":"bodycontent"}). FindAll ("a", href=re.compile ("^ (/wiki/) ((?!:).) *$")):if 'href' inchLink.attrs:Print(link.attrs['href'])
If you run the code, you'll see a link to all the other terms in the Kevin Bacon entry on Wikipedia.
Of course, writing a program to find out all the entry links in this static Wikipedia entry is interesting, but nothing really
Use. We need to make this procedure more like the following form.
? A function getlinks, can use the Wikipedia entry/wiki/< entry name > form URL link as the parameter,
Then return a list in the same form that contains all the entry URL links.
? A main function that invokes getlinks with a starting entry, and then randomly selects from the returned URL list
An entry link, and then call Getlinks until we actively stop, or there is no entry link on the new page
, the program stops running.
The complete code is as follows:
fromUrllib.requestImportUrlopen fromBs4ImportBeautifulSoupImportdatetimeImportRandomImportRerandom.seed (Datetime.datetime.now ())defgetlinks (articleurl): HTML= Urlopen ("http://en.wikipedia.org"+articleurl) Bsobj=BeautifulSoup (HTML)returnBsobj.find ("Div", {"ID":"bodycontent"}). FindAll ("a", href=re.compile ("^ (/wiki/) ((?!:).) *$") links= Getlinks ("/wiki/kevin_bacon") whileLen (links) >0:newarticle= Links[random.randint (0, Len (links)-1)].attrs["href"]Print(newarticle) links= Getlinks (newarticle)
Python Learning----traverse individual domain names and random numbers