For example, when I crawled the students online, I found that crawling was not a specific notice, such as the "Cofco Blessing" Fund application notice, through the analysis found that the original notification link was filtered out, The following is the filter URL configuration file regex-urlfilter.txt analysis, and later if you need to modify the configuration file can be modified according to their own circumstances:
Description: A behavior comment that begins with "#" in a configuration file, which is filtered by the expression that begins with "-", and the regular expression that begins with "+" is preserved. The "^" in the regular expression represents the beginning of the string, "$" means the end of the string, and "[]" represents the collection. The Chinese part is the comment I added
[Java]View PlainCopyPrint?
- # Licensed to the Apache software Foundation (ASF) under one or more
- # Contributor license agreements. See the NOTICE file distributed with
- # This work for additional information regarding copyright ownership.
- # The ASF licenses This file to you under the Apache License, Version 2.0
- # ( the "License"); You are not a use of this file except in compliance with
- # The License. Obtain a copy of the License at
- #
- # http://www.apache.org/licenses/license-2.0
- #
- # unless required by applicable or agreed to writing, software
- # Distributed under the License is distributed on a " as is" BASIS,
- # without warranties or CONDITIONS of any KIND, either express or implied.
- # See the License for the specific language governing permissions and
- # Limitations under the License.
- # The default URL filter.
- # Better for whole-internet crawling.
- # each non-comment, Non-blank line contains a regular expression
- # prefixed by ' + ' or '-'. The first matching pattern in the file
- # Determines whether a URL is included or ignored. If No pattern
- # matches, the URL is ignored.
- # Skip File:ftp:and Mailto:urls
- #过滤掉file: FTP, etc. is not a link to the HTML protocol
- -^ (File|ftp|mailto):
- # Skip image and other suffixes we can ' t yet parse
- #过滤掉图片等格式的链接
- -\. (gif| Gif|jpg| Jpg|png| png|ico| ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov| Mov|exe|jpeg| jpeg|bmp| BMP) $
- # skip URLs containing certain characters as probable queries, etc.
- #-[?*[email protected]=] filter out the links of the sweat special characters, because to crawl more links, so modify the filter conditions so that the inclusion? = The link is not filtered out
- -[*[email protected]]
- # Skip URLs with slash-delimited segment this repeats 3+ times, to break loops
- #过滤掉一些特殊格式的链接
- -.* (/[^/]+)/[^/]+\1/[^/]+\1/
- # Accept anything else
- #接受所有的链接, here you can make your own changes, yes only accept the type of their own link
# Licensed to the Apache software Foundation (ASF) under one or more# contributor license agreements. See the NOTICE file distributed with# This work for additional information regarding copyright ownership.# the ASF license s This file to you under the Apache License, Version 2.0# (the "License"); Except in compliance with# the License. Obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## unless required by applicable l AW or agreed to writing, software# distributed under the License are distributed on a "as is" basis,# without warrantie S or CONDITIONS of any KIND, either express OR implied.# see the License for the specific language governing permissions a nd# limitations under the license.# the default URL filter.# Better for whole-internet crawling.# each non-comment, NON-BL Ank line contains a regular expression# prefixed by ' + ' or '-'. The first matching pattern in the file# determines whether a URL is included orIgnored. If no pattern# matches, the URL is ignored.# Skip File:ftp:and mailto:urls# filter out file:ftp and so on is not a link to the HTML protocol-^ (FILE|FTP|MAILTO): # Skip image and other suffixes we can ' t yet parse# filter out the link-\ in the image format. (gif| Gif|jpg| Jpg|png| png|ico| ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov| Mov|exe|jpeg| jpeg|bmp| BMP) $# Skip URLs containing certain characters as probable queries, etc.#-[?*[email protected]=] filter out the links to the special characters of Khan, Because you want to crawl more links, modify the filters so that they are included. = The link is not filtered out-[*[email protected]]# skip URLs with slash-delimited segment that repeats the Times, to break loops# filter out some special lattices Type of link-.* (/[^/]+)/[^/]+\1/[^/]+\1/# accept anything else# accepts all links, here you can make your own modifications, yes only accept the link of your own prescribed type
reason explanation: Because the post link for crawling is (http://www.online.sdu.edu.cn/news/article.php?pid=636514943), the link contains? and = characters, so the regular expressions that are filtered by special characters are filtered out, and by modifying the Regex-urlfilter.txt configuration file (above), the link to such announcements can eventually be crawled.
Nutch How to modify Regex-urlfilter.txt crawl eligible links