Python crawler Scrapy's Linkextractor

Last Update:2017-12-24 Source: Internet

Author: User

Tags list of attributes xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Use the background:

We usually crawl to a site is to crawl some content under each tag, often a site home page will contain a lot of items or information detailed content, we only extract some of the content under a large tag, it will be less efficient, most of the site is based on fixed routines (that is, fixed template, To display various information to the user, Linkextrator is very suitable for the whole station crawl, why? Because you have some column parameter settings such as XPath, CSS, and so on, you get the link that you want for the whole site, not some of the linked content under a fixed tag, which is very suitable for the whole station crawl.

1 Importscrapy2  fromScrapy.linkextractorImportLinkextractor3 4 classWeidsspider (scrapy. Spider):5Name ="Weids"6Allowed_domains = ["wds.modian.com"]7Start_urls = ['http://www.gaosiedu.com/gsschool/']8 9     defParse (self, response):Tenlink = linkextractor (restrict_xpaths='//ul[@class = "Cont_xiaoqu"]/li') OneLinks =link.extract_links (response) A         Print(links)

Links is a list

Let's iterate over this list.

1          for inch Links: 2             Print (link)

Links contain the URL we want to extract, so how do we get the URL?

Directly inside the For loop, Link.url can get the URL and text message we want.

1          for inch Links: 2             Print (Link.url,link.text)

Don't worry, linkextrator there is more than one XPath extraction method, there are many parameters.

>allow: Receives a regular expression or a regular expression list, extracts the absolute URL hermetical the expression matches the link, if the parameter is empty, the default is all extracted.

1 #-*-coding:utf-8-*-2 Importscrapy3  fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten     defParse (self, response): OnePattern ='/gsschool/.+\.shtml' Alink = linkextractor (allow=pattern) -Links =link.extract_links (response) -         Print(Type (links)) the          forLinkinchLinks: -             Print(link)

>deny: Receive a regular expression or a list of regular expressions, in contrast to allow, exclude absolute URL hermetical The expression matches the link, in other words, anything that matches the regular expression can match all the non-extraction.

1 #-*-coding:utf-8-*-2 Importscrapy3  fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten     defParse (self, response): OnePattern ='/gsschool/.+\.shtml' Alink = linkextractor (deny=pattern) -Links =link.extract_links (response) -         Print(Type (links)) the          forLinkinchLinks: -             Print(link)

>allow_domains: Receives a domain name or a list of domain names and extracts links to the specified domain.

1 #-*-coding:utf-8-*-2 Importscrapy3  fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten     defParse (self, response): OneDomain = ['gaosivip.com','gaosiedu.com'] Alink = linkextractor (allow_domains=domain) -Links =link.extract_links (response) -         Print(Type (links)) the          forLinkinchLinks: -             Print(link)

>deny_domains: In contrast to Allow_doains, reject a domain name or a list of domains to extract all matching URLs except those that are denied.

1 #-*-coding:utf-8-*-2 Importscrapy3  fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten     defParse (self, response): OneDomain = ['gaosivip.com','gaosiedu.com'] Alink = linkextractor (deny_domains=domain) -Links =link.extract_links (response) -         Print(Type (links)) the          forLinkinchLinks: -             Print(link)

>restrict_xpaths: We're doing that example at the very beginning, receiving an XPath expression or a list of XPath expressions, extracting the links under the XPath expression's selected area.

>RESTRICT_CSS: This parameter and Restrict_xpaths parameters can often be used, so students must master, the individual prefers XPath.

1 #-*-coding:utf-8-*-2 Importscrapy3  fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten     defParse (self, response): Onelink = linkextractor (restrict_css='Ul.cont_xiaoqu > Li') ALinks =link.extract_links (response) -         Print(Type (links)) -          forLinkinchLinks: the             Print(link)

>tags: Receives a label (string) or a list of tags, extracts the link within the specified tag, and defaults to tags= (' a ', ' area ')

>attrs: Receives a property (string) or a list of attributes, extracts the link within the specified attribute, and defaults to attrs= (' href ',), an example, according to this method, the properties of some tags on this page will be extracted, as shown in the following example, the value of the href attribute of the A tag on this page is extracted.

1 #-*-coding:utf-8-*-2 Importscrapy3  fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten     defParse (self, response): Onelink = linkextractor (tags='a', attrs='href') ALinks =link.extract_links (response) -         Print(Type (links)) -          forLinkinchLinks: the             Print(link)

Python crawler Scrapy's Linkextractor

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More