If you just jump from one page to another, then the web crawler is very boring. In order to effectively make
With them, we need to do something on the page when using crawlers. Let's see how to create a crawler to collect
Sets the page title, the first paragraph in the body, and the link (if any) of the edit page.
As always, the first step in deciding how to do these things is to look at some of the pages on the site and then draw up a
A collection mode. By observing several Wikipedia pages, including the entry and non-entry pages, such as the Privacy Policy
page, the following rules will be drawn.
? All the headings (on all pages, whether the entry page, edit history page, or other page) are
H1→span tag, and there is only one H1 tag on the page.
? As mentioned earlier, all the text is in the div#bodycontent tag. But if we want to be more
Further acquisition of the first paragraph of text may be better with div#mw-content-text→p (select only the first paragraph of the label
Sign). This rule applies to all pages except the file page (for example, https://en.wikipedia.org/wiki/
FILE:ORBIT_OF_274301_WIKIPEDIA.SVG), the page does not contain part of the contents text (content text).
? The edit link appears only on the entry page. If there is an edit link, it is located in the Li#caedit of the Li#ca-edit label
→span→a inside.
Adjusting the previous code, we can build a crawler and data collection (at least data printing) combination program:
fromUrllib.requestImportUrlopen fromBs4ImportBeautifulSoupImportrepages=set ()defgetlinks (pageurl):Globalpageshtml= Urlopen ("http://en.wikipedia.org"+pageurl) Bsobj=BeautifulSoup (HTML)Try:Print(BsObj.h1.get_text ())Print(Bsobj.find (id="Mw-content-text"). FindAll ("P") [0])Print(Bsobj.find (id="Ca-edit"). Find ("span"). Find ("a"). attrs['href'])exceptAttributeerror:Print("The page is missing some properties! But don't worry! ") forLinkinchBsobj.findall ("a", Href=re.compile ("^ (/wiki/)")):if 'href' inchLink.attrs:iflink.attrs['href'] not inchPages:#we ran into a new page.NewPage= link.attrs['href']Print("----------------\ n"+newPage) Pages.Add (newPage) getlinks (newPage) getlinks ("")
This for loop is basically the same as the original collection program (except to print a dashed line to separate the different pages
and other external tolerances).
Because it is not possible to ensure that all types of data are available on every page, each print statement is based on the data on the page
The probability of appearing on the surface is arranged from high to low. This means that the,
identification, whichever page is there), so we first try to get its data. Body content appears on most pages
(except for the file page), it is the second obtained data. The Edit button appears only in the title and body contents
All have been acquired on the page, but not all of these types of pages have, so we finally print this kind of data.
Python learning----Collecting the entire site