Python Learning----traverse individual domain names and random numbers

Source: Internet
Author: User

Find the "Wikipedia six-degree separation theory" method. That is to say, we are going to implement from the Edgar · Edel's entry page (Https://en.wikipedia.org/wiki/Eric_Idle) starts with a minimum number of link clicks to find Kevin · Bacon's entry page (Https://en.wikipedia.org/wiki/Kevin_Bacon).

You should already know how to write a Python code that gets any page from the Wikipedia site and extracts the page links:

 fromUrllib.requestImportUrlopen fromBs4Importbeautifulsouphtml= Urlopen ("Http://en.wikipedia.org/wiki/Kevin_Bacon") Bsobj=BeautifulSoup (HTML) forLinkinchBsobj.findall ("a"):if 'href' inchLink.attrs:Print(link.attrs['href'])

If you look at the generated column of links, you'll see all the entry links you want: "Apollo 13"

"Philadelphia" and "Primetime Emmy Award", and so on. However, there are some links that we don't need:

Wikimediafoundation.org/wiki/privacy_policy

En.wikipedia.org/wiki/wikipedia:contact_us

In fact, every page in Wikipedia is filled with sidebar, header, footer links, and connections to category pages, dialogs

Links to pages and other pages that do not contain terms:

/wiki/category:articles_with_unsourced_statements_from_april_2014

/wiki/talk:kevin_bacon

I recently had a friend who was working on a project like Wikipedia, and he said that in order to judge the link in Wikipedia is

No link to an entry, he wrote a very large filter function, more than 100 lines of code. Unfortunately, it is possible that the item

When he started, he did not take the time to compare the differences between "entry links" and "other links", or he might later

Found the trick. If you look closely at the links that point to the entry page (not to other content pages),

Will find that they all have three things in common:

? They're all in the div tag with the ID bodycontent.

? URL link does not contain a semicolon

? URL links start with/wiki/

We can use these rules to tweak the code a little bit to get the entry link:

 fromUrllib.requestImportUrlopen fromBs4ImportBeautifulSoupImportrehtml= Urlopen ("Http://en.wikipedia.org/wiki/Kevin_Bacon") Bsobj=BeautifulSoup (HTML) forLinkinchBsobj.find ("Div", {"ID":"bodycontent"}). FindAll ("a", href=re.compile ("^ (/wiki/) ((?!:).) *$")):if 'href' inchLink.attrs:Print(link.attrs['href'])

If you run the code, you'll see a link to all the other terms in the Kevin Bacon entry on Wikipedia.

Of course, writing a program to find out all the entry links in this static Wikipedia entry is interesting, but nothing really

Use. We need to make this procedure more like the following form.

? A function getlinks, can use the Wikipedia entry/wiki/< entry name > form URL link as the parameter,

Then return a list in the same form that contains all the entry URL links.

? A main function that invokes getlinks with a starting entry, and then randomly selects from the returned URL list

An entry link, and then call Getlinks until we actively stop, or there is no entry link on the new page

, the program stops running.

The complete code is as follows:

 fromUrllib.requestImportUrlopen fromBs4ImportBeautifulSoupImportdatetimeImportRandomImportRerandom.seed (Datetime.datetime.now ())defgetlinks (articleurl): HTML= Urlopen ("http://en.wikipedia.org"+articleurl) Bsobj=BeautifulSoup (HTML)returnBsobj.find ("Div", {"ID":"bodycontent"}). FindAll ("a", href=re.compile ("^ (/wiki/) ((?!:).) *$") links= Getlinks ("/wiki/kevin_bacon") whileLen (links) >0:newarticle= Links[random.randint (0, Len (links)-1)].attrs["href"]Print(newarticle) links= Getlinks (newarticle)

Python Learning----traverse individual domain names and random numbers

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.