Python crawler Scrapy's Linkextractor

Source: Internet
Author: User
Tags list of attributes xpath

Use the background:

We usually crawl to a site is to crawl some content under each tag, often a site home page will contain a lot of items or information detailed content, we only extract some of the content under a large tag, it will be less efficient, most of the site is based on fixed routines (that is, fixed template, To display various information to the user, Linkextrator is very suitable for the whole station crawl, why? Because you have some column parameter settings such as XPath, CSS, and so on, you get the link that you want for the whole site, not some of the linked content under a fixed tag, which is very suitable for the whole station crawl.

1 Importscrapy2  fromScrapy.linkextractorImportLinkextractor3 4 classWeidsspider (scrapy. Spider):5Name ="Weids"6Allowed_domains = ["wds.modian.com"]7Start_urls = ['http://www.gaosiedu.com/gsschool/']8 9     defParse (self, response):Tenlink = linkextractor (restrict_xpaths='//ul[@class = "Cont_xiaoqu"]/li') OneLinks =link.extract_links (response) A         Print(links)

Links is a list

Let's iterate over this list.

1          for inch Links: 2             Print (link)

Links contain the URL we want to extract, so how do we get the URL?

Directly inside the For loop, Link.url can get the URL and text message we want.

1          for inch Links: 2             Print (Link.url,link.text)

Don't worry, linkextrator there is more than one XPath extraction method, there are many parameters.

>allow: Receives a regular expression or a regular expression list, extracts the absolute URL hermetical the expression matches the link, if the parameter is empty, the default is all extracted.

1 #-*-coding:utf-8-*-2 Importscrapy3  fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten     defParse (self, response): OnePattern ='/gsschool/.+\.shtml' Alink = linkextractor (allow=pattern) -Links =link.extract_links (response) -         Print(Type (links)) the          forLinkinchLinks: -             Print(link)

>deny: Receive a regular expression or a list of regular expressions, in contrast to allow, exclude absolute URL hermetical The expression matches the link, in other words, anything that matches the regular expression can match all the non-extraction.

1 #-*-coding:utf-8-*-2 Importscrapy3  fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten     defParse (self, response): OnePattern ='/gsschool/.+\.shtml' Alink = linkextractor (deny=pattern) -Links =link.extract_links (response) -         Print(Type (links)) the          forLinkinchLinks: -             Print(link)

>allow_domains: Receives a domain name or a list of domain names and extracts links to the specified domain.

1 #-*-coding:utf-8-*-2 Importscrapy3  fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten     defParse (self, response): OneDomain = ['gaosivip.com','gaosiedu.com'] Alink = linkextractor (allow_domains=domain) -Links =link.extract_links (response) -         Print(Type (links)) the          forLinkinchLinks: -             Print(link)

>deny_domains: In contrast to Allow_doains, reject a domain name or a list of domains to extract all matching URLs except those that are denied.

1 #-*-coding:utf-8-*-2 Importscrapy3  fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten     defParse (self, response): OneDomain = ['gaosivip.com','gaosiedu.com'] Alink = linkextractor (deny_domains=domain) -Links =link.extract_links (response) -         Print(Type (links)) the          forLinkinchLinks: -             Print(link)

>restrict_xpaths: We're doing that example at the very beginning, receiving an XPath expression or a list of XPath expressions, extracting the links under the XPath expression's selected area.

>RESTRICT_CSS: This parameter and Restrict_xpaths parameters can often be used, so students must master, the individual prefers XPath.

1 #-*-coding:utf-8-*-2 Importscrapy3  fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten     defParse (self, response): Onelink = linkextractor (restrict_css='Ul.cont_xiaoqu > Li') ALinks =link.extract_links (response) -         Print(Type (links)) -          forLinkinchLinks: the             Print(link)

>tags: Receives a label (string) or a list of tags, extracts the link within the specified tag, and defaults to tags= (' a ', ' area ')

>attrs: Receives a property (string) or a list of attributes, extracts the link within the specified attribute, and defaults to attrs= (' href ',), an example, according to this method, the properties of some tags on this page will be extracted, as shown in the following example, the value of the href attribute of the A tag on this page is extracted.

1 #-*-coding:utf-8-*-2 Importscrapy3  fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten     defParse (self, response): Onelink = linkextractor (tags='a', attrs='href') ALinks =link.extract_links (response) -         Print(Type (links)) -          forLinkinchLinks: the             Print(link)

Python crawler Scrapy's Linkextractor

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.