Use the background:
We usually crawl to a site is to crawl some content under each tag, often a site home page will contain a lot of items or information detailed content, we only extract some of the content under a large tag, it will be less efficient, most of the site is based on fixed routines (that is, fixed template, To display various information to the user, Linkextrator is very suitable for the whole station crawl, why? Because you have some column parameter settings such as XPath, CSS, and so on, you get the link that you want for the whole site, not some of the linked content under a fixed tag, which is very suitable for the whole station crawl.
1 Importscrapy2 fromScrapy.linkextractorImportLinkextractor3 4 classWeidsspider (scrapy. Spider):5Name ="Weids"6Allowed_domains = ["wds.modian.com"]7Start_urls = ['http://www.gaosiedu.com/gsschool/']8 9 defParse (self, response):Tenlink = linkextractor (restrict_xpaths='//ul[@class = "Cont_xiaoqu"]/li') OneLinks =link.extract_links (response) A Print(links)
Links is a list
Let's iterate over this list.
1 for inch Links: 2 Print (link)
Links contain the URL we want to extract, so how do we get the URL?
Directly inside the For loop, Link.url can get the URL and text message we want.
1 for inch Links: 2 Print (Link.url,link.text)
Don't worry, linkextrator there is more than one XPath extraction method, there are many parameters.
>allow: Receives a regular expression or a regular expression list, extracts the absolute URL hermetical the expression matches the link, if the parameter is empty, the default is all extracted.
1 #-*-coding:utf-8-*-2 Importscrapy3 fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten defParse (self, response): OnePattern ='/gsschool/.+\.shtml' Alink = linkextractor (allow=pattern) -Links =link.extract_links (response) - Print(Type (links)) the forLinkinchLinks: - Print(link)
>deny: Receive a regular expression or a list of regular expressions, in contrast to allow, exclude absolute URL hermetical The expression matches the link, in other words, anything that matches the regular expression can match all the non-extraction.
1 #-*-coding:utf-8-*-2 Importscrapy3 fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten defParse (self, response): OnePattern ='/gsschool/.+\.shtml' Alink = linkextractor (deny=pattern) -Links =link.extract_links (response) - Print(Type (links)) the forLinkinchLinks: - Print(link)
>allow_domains: Receives a domain name or a list of domain names and extracts links to the specified domain.
1 #-*-coding:utf-8-*-2 Importscrapy3 fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten defParse (self, response): OneDomain = ['gaosivip.com','gaosiedu.com'] Alink = linkextractor (allow_domains=domain) -Links =link.extract_links (response) - Print(Type (links)) the forLinkinchLinks: - Print(link)
>deny_domains: In contrast to Allow_doains, reject a domain name or a list of domains to extract all matching URLs except those that are denied.
1 #-*-coding:utf-8-*-2 Importscrapy3 fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten defParse (self, response): OneDomain = ['gaosivip.com','gaosiedu.com'] Alink = linkextractor (deny_domains=domain) -Links =link.extract_links (response) - Print(Type (links)) the forLinkinchLinks: - Print(link)
>restrict_xpaths: We're doing that example at the very beginning, receiving an XPath expression or a list of XPath expressions, extracting the links under the XPath expression's selected area.
>RESTRICT_CSS: This parameter and Restrict_xpaths parameters can often be used, so students must master, the individual prefers XPath.
1 #-*-coding:utf-8-*-2 Importscrapy3 fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten defParse (self, response): Onelink = linkextractor (restrict_css='Ul.cont_xiaoqu > Li') ALinks =link.extract_links (response) - Print(Type (links)) - forLinkinchLinks: the Print(link)
>tags: Receives a label (string) or a list of tags, extracts the link within the specified tag, and defaults to tags= (' a ', ' area ')
>attrs: Receives a property (string) or a list of attributes, extracts the link within the specified attribute, and defaults to attrs= (' href ',), an example, according to this method, the properties of some tags on this page will be extracted, as shown in the following example, the value of the href attribute of the A tag on this page is extracted.
1 #-*-coding:utf-8-*-2 Importscrapy3 fromScrapy.linkextractorImportLinkextractor4 5 classWeidsspider (scrapy. Spider):6Name ="Weids"7Allowed_domains = ["wds.modian.com"]8Start_urls = ['http://www.gaosiedu.com/gsschool/']9 Ten defParse (self, response): Onelink = linkextractor (tags='a', attrs='href') ALinks =link.extract_links (response) - Print(Type (links)) - forLinkinchLinks: the Print(link)
Python crawler Scrapy's Linkextractor