Nutch How to modify Regex-urlfilter.txt crawl eligible links

Source: Internet
Author: User


For example, when I crawled the students online, I found that crawling was not a specific notice, such as the "Cofco Blessing" Fund application notice, through the analysis found that the original notification link was filtered out, The following is the filter URL configuration file regex-urlfilter.txt analysis, and later if you need to modify the configuration file can be modified according to their own circumstances:

Description: A behavior comment that begins with "#" in a configuration file, which is filtered by the expression that begins with "-", and the regular expression that begins with "+" is preserved. The "^" in the regular expression represents the beginning of the string, "$" means the end of the string, and "[]" represents the collection. The Chinese part is the comment I added

[Java]View PlainCopyPrint?
  1. # Licensed to the Apache software Foundation (ASF) under one or more
  2. # Contributor license agreements. See the NOTICE file distributed with
  3. # This work for additional information regarding copyright ownership.
  4. # The ASF licenses This file to you under the Apache License, Version 2.0
  5. # ( the "License"); You are not a use of this file except in compliance with
  6. # The License. Obtain a copy of the License at
  7. #
  8. # http://www.apache.org/licenses/license-2.0
  9. #
  10. # unless required by applicable or agreed to writing, software
  11. # Distributed under the License is distributed on a " as is" BASIS,
  12. # without warranties or CONDITIONS of any KIND, either express or implied.
  13. # See the License for the specific language governing permissions and
  14. # Limitations under the License.
  15. # The default URL filter.
  16. # Better for whole-internet crawling.
  17. # each non-comment, Non-blank line contains a regular expression
  18. # prefixed by ' + ' or '-'. The first matching pattern in the file
  19. # Determines whether a URL is included or ignored. If No pattern
  20. # matches, the URL is ignored.
  21. # Skip File:ftp:and Mailto:urls
  22. #过滤掉file: FTP, etc. is not a link to the HTML protocol
  23. -^ (File|ftp|mailto):
  24. # Skip image and other suffixes we can ' t yet parse
  25. #过滤掉图片等格式的链接
  26. -\. (gif| Gif|jpg| Jpg|png| png|ico| ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov| Mov|exe|jpeg| jpeg|bmp| BMP) $
  27. # skip URLs containing certain characters as probable queries, etc.
  28. #-[?*[email protected]=] filter out the links of the sweat special characters, because to crawl more links, so modify the filter conditions so that the inclusion? = The link is not filtered out
  29. -[*[email protected]]
  30. # Skip URLs with slash-delimited segment this repeats 3+ times, to break loops
  31. #过滤掉一些特殊格式的链接
  32. -.* (/[^/]+)/[^/]+\1/[^/]+\1/
  33. # Accept anything else
  34. #接受所有的链接, here you can make your own changes, yes only accept the type of their own link   
# Licensed to the Apache software Foundation (ASF) under one or more# contributor license agreements. See the NOTICE file distributed with# This work for additional information regarding copyright ownership.# the ASF license s This file to you under the Apache License, Version 2.0# (the "License");  Except in compliance with# the License. Obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## unless required by applicable l AW or agreed to writing, software# distributed under the License are distributed on a "as is" basis,# without warrantie S or CONDITIONS of any KIND, either express OR implied.# see the License for the specific language governing permissions a nd# limitations under the license.# the default URL filter.# Better for whole-internet crawling.# each non-comment, NON-BL  Ank line contains a regular expression# prefixed by ' + ' or '-'. The first matching pattern in the file# determines whether a URL is included orIgnored.  If no pattern# matches, the URL is ignored.# Skip File:ftp:and mailto:urls# filter out file:ftp and so on is not a link to the HTML protocol-^ (FILE|FTP|MAILTO): # Skip image and other suffixes we can ' t yet parse# filter out the link-\ in the image format. (gif| Gif|jpg| Jpg|png| png|ico| ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov| Mov|exe|jpeg| jpeg|bmp| BMP) $# Skip URLs containing certain characters as probable queries, etc.#-[?*[email protected]=] filter out the links to the special characters of Khan, Because you want to crawl more links, modify the filters so that they are included. = The link is not filtered out-[*[email protected]]# skip URLs with slash-delimited segment that repeats the Times, to break loops# filter out some special lattices Type of link-.* (/[^/]+)/[^/]+\1/[^/]+\1/# accept anything else# accepts all links, here you can make your own modifications, yes only accept the link of your own prescribed type

reason explanation: Because the post link for crawling is (http://www.online.sdu.edu.cn/news/article.php?pid=636514943), the link contains? and = characters, so the regular expressions that are filtered by special characters are filtered out, and by modifying the Regex-urlfilter.txt configuration file (above), the link to such announcements can eventually be crawled.

Nutch How to modify Regex-urlfilter.txt crawl eligible links

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.