Python learning----Collecting the entire site

Source: Internet
Author: User

If you just jump from one page to another, then the web crawler is very boring. In order to effectively make

With them, we need to do something on the page when using crawlers. Let's see how to create a crawler to collect

Sets the page title, the first paragraph in the body, and the link (if any) of the edit page.

As always, the first step in deciding how to do these things is to look at some of the pages on the site and then draw up a

A collection mode. By observing several Wikipedia pages, including the entry and non-entry pages, such as the Privacy Policy

page, the following rules will be drawn.

? All the headings (on all pages, whether the entry page, edit history page, or other page) are

H1→span tag, and there is only one H1 tag on the page.

? As mentioned earlier, all the text is in the div#bodycontent tag. But if we want to be more

Further acquisition of the first paragraph of text may be better with div#mw-content-text→p (select only the first paragraph of the label

Sign). This rule applies to all pages except the file page (for example, https://en.wikipedia.org/wiki/

FILE:ORBIT_OF_274301_WIKIPEDIA.SVG), the page does not contain part of the contents text (content text).

? The edit link appears only on the entry page. If there is an edit link, it is located in the Li#caedit of the Li#ca-edit label

→span→a inside.

Adjusting the previous code, we can build a crawler and data collection (at least data printing) combination program:

 fromUrllib.requestImportUrlopen fromBs4ImportBeautifulSoupImportrepages=set ()defgetlinks (pageurl):Globalpageshtml= Urlopen ("http://en.wikipedia.org"+pageurl) Bsobj=BeautifulSoup (HTML)Try:Print(BsObj.h1.get_text ())Print(Bsobj.find (id="Mw-content-text"). FindAll ("P") [0])Print(Bsobj.find (id="Ca-edit"). Find ("span"). Find ("a"). attrs['href'])exceptAttributeerror:Print("The page is missing some properties! But don't worry! ") forLinkinchBsobj.findall ("a", Href=re.compile ("^ (/wiki/)")):if 'href' inchLink.attrs:iflink.attrs['href'] not inchPages:#we ran into a new page.NewPage= link.attrs['href']Print("----------------\ n"+newPage) Pages.Add (newPage) getlinks (newPage) getlinks ("")

This for loop is basically the same as the original collection program (except to print a dashed line to separate the different pages

and other external tolerances).

Because it is not possible to ensure that all types of data are available on every page, each print statement is based on the data on the page

The probability of appearing on the surface is arranged from high to low. This means that the,

identification, whichever page is there), so we first try to get its data. Body content appears on most pages

(except for the file page), it is the second obtained data. The Edit button appears only in the title and body contents

All have been acquired on the page, but not all of these types of pages have, so we finally print this kind of data.

Python learning----Collecting the entire site

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.