[Crawler learning notes] Url filtering module UrlFilter, learning notes urlfilter
Url Filter filters the extracted URLs again. The criteria for filtering different applications are different. For example, for baidu/google search, the criteria are generally not filtered, but for vertical search or targeted crawling applications, it may only need URLs that meet certain conditions, such as URLs that do not require images, or URLs of a specific website. Url Filter is a module closely related to applications.
using System;using System.Collections.Generic;using Crawler.Common;namespace Crawler.Processing{ public class UrlFilter { public static List<Uri> RemoveByRegex(List<Uri> uris, params string[] regexs) { var uriList=new List<Uri>(uris); for (var i = 0; i < uriList.Count; i++) { foreach (var r in regexs) { if (!RegexHelper.IsMatch(uriList[i].ToString(), r)) continue; uris.RemoveAt(i); i--; } } return uriList; } public static List<Uri> SelectByRegex(List<Uri> uris, params string[] regexs) { var uriList = new List<Uri>(); foreach (var t in uris) foreach (var r in regexs) if (RegexHelper.IsMatch(t.ToString(), r)) if(!uriList.Contains(t)) uriList.Add(t); return uriList; } }}