The URL filter filters the extracted URLs again. Different application filtering criteria are not the same, for example, for Baidu/google search, generally do not filter, but for vertical search or targeted crawl of the application, it may only need to meet a certain condition of the URL, such as do not need the image of the URL, such as the need for a specific site URL, and so on. The URL filter is a module that is closely related to the application.
usingSystem;usingSystem.Collections.Generic;usingCrawler.common;namespacecrawler.processing{ Public classUrlfilter { Public StaticList<uri> Removebyregex (list<uri> URIs,params string[] regexs) { varurilist=NewList<uri>(URIs); for(vari =0; i < Urilist.count; i++) { foreach(varRinchRegexs) { if(! Regexhelper.ismatch (Urilist[i]. ToString (), R))Continue; URIs. RemoveAt (i); I--; } } returnurilist; } Public StaticList<uri> Selectbyregex (list<uri> URIs,params string[] regexs) { varUrilist =NewList<uri>(); foreach(varTinchURIs)foreach(varRinchRegexs)if(Regexhelper.ismatch (T.tostring (), R))if(!urilist.contains (t)) Urilist.add (t); returnurilist; } }}
[Crawler Learning Notes] URL Filter Module Urlfilter