Nutch How to modify Regex-urlfilter.txt crawl eligible links

Last Update:2017-11-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For example, when I crawled the students online, I found that crawling was not a specific notice, such as the "Cofco Blessing" Fund application notice, through the analysis found that the original notification link was filtered out, The following is the filter URL configuration file regex-urlfilter.txt analysis, and later if you need to modify the configuration file can be modified according to their own circumstances:

Description: A behavior comment that begins with "#" in a configuration file, which is filtered by the expression that begins with "-", and the regular expression that begins with "+" is preserved. The "^" in the regular expression represents the beginning of the string, "$" means the end of the string, and "[]" represents the collection. The Chinese part is the comment I added

[Java]View PlainCopyPrint?

# Licensed to the Apache software Foundation (ASF) under one or more
# Contributor license agreements. See the NOTICE file distributed with
# This work for additional information regarding copyright ownership.
# The ASF licenses This file to you under the Apache License, Version 2.0
# ( the "License"); You are not a use of this file except in compliance with
# The License. Obtain a copy of the License at
#
# http://www.apache.org/licenses/license-2.0
#
# unless required by applicable or agreed to writing, software
# Distributed under the License is distributed on a " as is" BASIS,
# without warranties or CONDITIONS of any KIND, either express or implied.
# See the License for the specific language governing permissions and
# Limitations under the License.
# The default URL filter.
# Better for whole-internet crawling.
# each non-comment, Non-blank line contains a regular expression
# prefixed by ' + ' or '-'. The first matching pattern in the file
# Determines whether a URL is included or ignored. If No pattern
# matches, the URL is ignored.
# Skip File:ftp:and Mailto:urls
#过滤掉file: FTP, etc. is not a link to the HTML protocol
-^ (File|ftp|mailto):
# Skip image and other suffixes we can ' t yet parse
#过滤掉图片等格式的链接
-\. (gif| Gif|jpg| Jpg|png| png|ico| ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov| Mov|exe|jpeg| jpeg|bmp| BMP) $
# skip URLs containing certain characters as probable queries, etc.
#-[?*[email protected]=] filter out the links of the sweat special characters, because to crawl more links, so modify the filter conditions so that the inclusion? = The link is not filtered out
-[*[email protected]]
# Skip URLs with slash-delimited segment this repeats 3+ times, to break loops
#过滤掉一些特殊格式的链接
-.* (/[^/]+)/[^/]+\1/[^/]+\1/
# Accept anything else
#接受所有的链接, here you can make your own changes, yes only accept the type of their own link

# Licensed to the Apache software Foundation (ASF) under one or more# contributor license agreements. See the NOTICE file distributed with# This work for additional information regarding copyright ownership.# the ASF license s This file to you under the Apache License, Version 2.0# (the "License");  Except in compliance with# the License. Obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## unless required by applicable l AW or agreed to writing, software# distributed under the License are distributed on a "as is" basis,# without warrantie S or CONDITIONS of any KIND, either express OR implied.# see the License for the specific language governing permissions a nd# limitations under the license.# the default URL filter.# Better for whole-internet crawling.# each non-comment, NON-BL  Ank line contains a regular expression# prefixed by ' + ' or '-'. The first matching pattern in the file# determines whether a URL is included orIgnored.  If no pattern# matches, the URL is ignored.# Skip File:ftp:and mailto:urls# filter out file:ftp and so on is not a link to the HTML protocol-^ (FILE|FTP|MAILTO): # Skip image and other suffixes we can ' t yet parse# filter out the link-\ in the image format. (gif| Gif|jpg| Jpg|png| png|ico| ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov| Mov|exe|jpeg| jpeg|bmp| BMP) $# Skip URLs containing certain characters as probable queries, etc.#-[?*[email protected]=] filter out the links to the special characters of Khan, Because you want to crawl more links, modify the filters so that they are included. = The link is not filtered out-[*[email protected]]# skip URLs with slash-delimited segment that repeats the Times, to break loops# filter out some special lattices Type of link-.* (/[^/]+)/[^/]+\1/[^/]+\1/# accept anything else# accepts all links, here you can make your own modifications, yes only accept the link of your own prescribed type

reason explanation: Because the post link for crawling is (http://www.online.sdu.edu.cn/news/article.php?pid=636514943), the link contains? and = characters, so the regular expressions that are filtered by special characters are filtered out, and by modifying the Regex-urlfilter.txt configuration file (above), the link to such announcements can eventually be crawled.

Nutch How to modify Regex-urlfilter.txt crawl eligible links

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Nutch How to modify Regex-urlfilter.txt crawl eligible links

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Nutch How to modify Regex-urlfilter.txt crawl eligible links

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support