First, regular expression
* Table matches 0 or more times a*b*
+ Table at least once
[] Match any one
() identify a group
{m,n} m or n times
[^] matches any characters that are not in brackets
| Indicate or
. Indicates a match for any character
^ Start of the table character ^a to start with a
\ means escape character
$ and ^ opposite starts at the end of the string
?! does not contain
Second, get the attribute
Get all the properties of a label
Mytag.attrs
Get the resource location of the picture SRC
myimgtag.attrs["SRC"]
To get a Web page function:
Random.seed (Datetime.datetime.now ())
def getlinks (Articleurl):
html = Urlopen ("http://en.wikipedia.org" +articleurl)
BS0BJ = BeautifulSoup (HTML)
return Bs0bj.find ("div", {"id": "bodycontent"}). FindAll ("A", Herf=re.compile ("^ (/wiki/)" (?!:).) *$"))
Links = getlinks ("/wiki/kevin_bacon")
While Len (links) > 0:
Newarticle = Links[random.randint (0,len (links)-1)].attrs["href"]
Print (newarticle)
link = getlinks (newarticle)
Python Network data Acquisition II