"Turn" 44 Java web crawler open source software

Source: Internet
Author: User

Original address Http://www.oschina.net/project/lang/19?tag=64&sort=time

  • Minimalist web crawler Components WebFetch

    WebFetch is a micro crawler that can run on mobile devices, without relying on minimalist web crawling components. WebFetch to achieve: No third-party dependent jar packages reduce memory usage increase CPU utilization Accelerate network crawl speed simple and straightforward API interface can be run stably on Android devices Small and flexible can be easily integrated web crawler components use ... More WebFetch Information

  • Open Source Crawler Framework Guozhongcrawler

    Guozhongcrawler is a non-configurable, easy two-time development of the crawler open source framework, it provides a simple and flexible API, only a small amount of code to implement a crawler. The design inspiration is derived from the summarization of reptile framework at home and abroad. A fully modular design that covers the entire crawler lifecycle (link extraction, page download, content extraction 、... More Guozhongcrawler Information

  • Web crawler Kamike.collect

    Another simple Crawler another network crawler, can support proxy server Fq crawl. 1. Data exists in MySQL. 2. When using, first modify Web-inf/config.ini data link related information, mainly the database name and user name and password 3. Then access the Http://127.0.0.1/fetch/install link to automatically create the database table ... More Kamike.collect Information

  • Web version crawler Spider-web

    Spider-web is the web version of the crawler, which uses XML configuration, supports crawling of most pages, and supports the saving, downloading, etc. of crawling content. Where the configuration file format is: <?xml version= "1.0" encoding= "UTF-8"?> <content> <url type= "simple" ><!--simple/ Complex---<url_head>http://www.oschina .... More Spider-web Information

  • Ugly Cow Mini Collector

    Ugly Cow Mini collector is a Java swing based on the development of professional network data acquisition/information mining processing software, through the flexible configuration, can easily and quickly crawl from the Web page structured text, pictures, files and other resource information, can be edited after the selection process to publish to the site Architecture description system is based on Swing+spring-3.2.4+mybatis-3 ... More information about the Ugly bull mini-collector

  • Java crawler Webcollector

    Crawler profile: Webcollector is a Java crawler framework (kernel) that is not configurable and facilitates two development, providing a streamlined API that enables a powerful crawler with just a small amount of code. Crawler Kernel: Webcollector is committed to maintaining a stable, extensible crawler core, enabling developers to develop flexibly and two times. The kernel has a very strong ... More Webcollector Information

  • WEB Data Pump Client Webstraktor

    Webstraktor is a programmable WWW data extraction client that provides a scripting language for collecting, extracting, and storing data from the WEB, including images. The scripting language uses regular expressions and XPath syntax. The standard output is in XML format and supports ASCII, UTF-8, and Iso885_1. Provides logging and tracking information. ... More Webstraktor Information

  • Network Data capture Framework Tinyspider

    Tinyspider is a network data grabbing framework based on tiny Htmlparser. MAVEN reference coordinates: <dependency> <groupId>org.tinygroup</groupId> <artifactid>org.tinygroup.spider </artifactId> <version>0.1.0-SNAPSHOT</version> </dependency> web crawler, generally used in full-text inspection ... More Tinyspider Information

  • Script programming Language Crawlscript

    The web crawler script language on the Java Platform Crawlscript web crawler is a program that automatically obtains web page information, there are a lot of Java, C + + network Crawler library, but on the basis of these class library development is very cumbersome, requires a lot of code to complete a simple operation. In view of this problem, we developed the Crawlscript script language, the process ... More Crawlscript Information

  • Based on Apache Nutch and Htmlunit Extension Implementation Ajax page crawler Crawl parsing plugin Nutch-htmlunit

    Nutch Htmlunit Plugin Project introduction based on the Apache Nutch 1.8 and Htmlunit components, the full page content fetch parsing for the AJAX load Type page is implemented. According to the implementation of Apache Nutch 1.8, we can ' t get a dynamic HTML information from fetch pages including AJ. . More Nutch-htmlunit Information

    Last updated: Nutch-htmlunit 1.8 Release: Implementation of AJAX page crawler crawl parsing plugin based on Apache Nutch and Htmlunit extensions posted 10 month ago

  • Web crawler Goodcrawler

    Goodcrawler (GC) Web crawler GC is a vertical domain crawler and a search engine for unpacking. GC is based on HttpClient, Htmlunit, Jsoup, Elasticsearch. GC Features: 1, a template with DSL features. 2, distributed, extensible. 3, Xin Htmlunit, it can better support JavaScript. 5, Hopewell ... More Goodcrawler Information

  • Vertical crawler WebMagic

    WebMagic is a crawler framework that does not need to be configured and facilitates two development, providing a simple and flexible API that allows a crawler to be implemented in just a small amount of code. Here is a piece of code to crawl the Oschina blog: spider.create (New Simplepageprocessor ("http://my.oschina.net/", "http://my.oschina.net/*/ blog/* ")) .... More webmagic Information

    Last updated: WebMagic 0.5.2 Released, Java Crawler Framework posted 1 year ago

  • Retrieving the crawler frame Heydr

    Heydr is a Java-based lightweight, open-source, multi-threaded vertical search crawler framework that follows the GNU GPL V3 protocol. Users can build their own vertical resource crawler through HEYDR, which is used to prepare data for vertical search engines. More HEYDR Information

  • Opm-server-mirror

    Code update 2009-11-25: Added anti-crawler functionality. The direct Web Access server will jump to Google. Use the method download Index.zip unzip index.zip get index.php to upload index.php to support PHP and curl on the foreign server open http://www.your_website.com/your_folder_if _any/, if the page jumps to goo ... More Opm-server-mirror Information

  • Java Web spider/web crawler spiderman

    Spiderman-another Java web spider/Reptile Spiderman is a micro-kernel + plug-in architecture of the network spider, its goal is to use a simple method to the complex target Web page information can be captured and resolved to their own needs of business data. Key Features * Flexible, scalable, micro-core + plug-in architecture, Spiderman provides up to ... More Spiderman Information

  • Web Search and crawler leopdo

    Web search and crawlers written in Java, including full-text and categorical vertical search, and word breakers more leopdo information

  • OWASP AJAX Crawling Tool

    OWASP Ajax crawling Tool (Fuzzops-ng) OWASP produced Ajax crawlers, written in Java, open source code. More owasp AJAX crawling tool information

  • Ajax Crawler and Test Crawljax

    Crawljax:java Write, open source code. Crawljax is a Java tool for automating crawling and testing Ajax WEB applications today. More Crawljax Information

  • Common Crawl

    The Commoncrawl Source Library is a custom InputFormat fulfillment implementation for Hadoop. Common Crawl provides a sample program Basicarcfilereadersample.java (located in org.commoncrawl.samples) used to configure InputFormat. ... More common crawl Information

  • Data collection System Chukwa

    What is Chukwa, simply said it is a data collection system that collects all kinds of data into Hadoop-ready files for Hadoop to perform various MapReduce operations. Chukwa itself also provides a number of built-in features that help us collect and collate data. For a more simple and intuitive display ... More Chukwa Information

  • Simple HTTP crawler Httpbot

    Httpbot is a simple package for the Java.net.HttpURLConnection class, it can easily get the Web content, and automatically manage the session, automatically handle 301 redirects and so on. Although not as powerful as httpclient, it supports the full HTTP protocol, but is very flexible enough to meet all of my current needs. ... More Httpbot Information

  • Web Mining Toolkit Bixo

    Bixo is an open source Web mining toolkit that is developed and run based on Hadoop. By creating a custom cascade assembly, you can quickly create web mining applications that are specifically optimized for specific use cases. More Bixo Information

  • Web crawler Crawlzilla

    Crawlzilla is a free software that helps you to build your search engine easily, and with it, you don't have to rely on the company's search engine, and you don't have to worry about it. The worries of the company's internal Web site index is the core of the Nutch project, integrating more related packages and developing a design installation and management UI, Make it easier for users to get started. Crawlzilla In addition to crawling basic ... More Crawlzilla Information

  • Web crawler Ex-crawler

    Ex-crawler is a web crawler, using Java development, the project is divided into two parts, one is the daemon, and the other is a flexible and configurable Web crawler. Use a database to store Web page information. More Ex-crawler Information

  • Web crawler Playfish

    Playfish is a web crawler that uses Java technology to integrate multiple open source Java components, and uses XML configuration files to achieve highly customizable and scalable web crawlers using open source jar packages including httpclient (content reading). DOM4J (Profile parsing), Jericho (HTML parsing), is already under the lib of the war package. This item ... More Playfish Information

  • Web crawler jcrawl

    Jcrawl is a small, high-performance web crawler that can fetch various types of files from a Web page, based on user-defined symbols such as EMAIL,QQ. More jcrawl Information

  • Java Multi-threaded web crawler crawler4j

    CRAWLER4J is an open source Java class Library that provides a simple interface for crawling Web pages. It can be used to build a multi-threaded web crawler. Example code: Import java.util.ArrayList; Import Java.util.regex.Pattern; Import Edu.uci.ics.crawler4j.crawler.Page; Import Edu.uci.ics.cr ... More crawler4j Information

  • Web crawler framework Smart and simple Web Crawler

    Smart and simple web crawler is a web crawler framework. Integrated Lucene support. The crawler can start with a single link or an array of links, providing two traversal modes: maximum iteration and maximum depth. You can set the filter limit crawl back link, default provides three filters ServerFilter, Beginningpathfilter and Regulare ... More smart and simple Web crawler information

  • Tools to generate PDFs from URLs h2p

    A solution that generates a bookmarked PDF document based on a bulk URL. H2p-file is an XML file that mainly describes the URL's information and the hierarchy of URLs, and H2p-tool generates bookmarked PDF documents based on H2p-file. The hierarchy of URLs can also be presented directly through XSL, and the support of the co-site to H2P will be simple ... More h2p Information

  • Web Search crawler Blueleech

    Blueleech is an open source program that starts with the specified URL, searches for all available links, and links above them. It can download the contents of all or predefined ranges that are encountered by the link while searching. More Blueleech Information

  • Job Information crawler Jobhunter

    Jobhunter is designed to automatically get recruitment information from a number of large sites, such as Chinahr,51job,zhaopin and more. Jobhunter searches the email address of each work item and automatically sends the application text to that email address. More Jobhunter Information

  • Java web crawler Jspider

    Jspider is a Java implementation of the Webspider,jspider execution format as follows: Jspider [url] [configname] URL must be added to the protocol name, such as:/HTTP, otherwise it will error. If ConfigName is omitted, the default configuration is used. Jspider behavior is configured by the configuration file, such as the use of what plug-in, the result of the storage side ... More Jspider Information

  • Itsucks

    Itsucks is a Java web spider (Web bot, crawler) Open source project. Download templates and regular expressions are supported to define download rules.  Provides a swing GUI operator interface. More itsucks Information

  • Web-harvest

    Web-harvest is a Java open source Web data extraction tool. It collects the specified Web pages and extracts useful data from those pages. Web-harvest mainly uses such techniques as xslt,xquery and regular expressions to realize the operation of Text/xml. More web-harvest Information

  • Jobo

    Jobo is a simple tool for downloading an entire Web site. It is essentially a web Spider. Compared to other download tools, its main advantage is the ability to automatically populate a form (e.g., auto-login) and use cookies to process the session. Jobo also has flexible download rules (e.g., URL, size, MIME type, etc.) to restrict downloads. ... More Jobo Information

  • Larm

    Larm can provide a pure Java search solution for users of Jakarta Lucene search engine framework. It contains methods to index files, database tables, and crawlers that index Web sites. More Larm Information

  • Arachnid

    Arachnid is a Java-based web spider framework. It contains a simple HTML parser capable of parsing an input stream that contains HTML content. By implementing the Arachnid subclass, you can develop a simple web Spiders and the ability to add a few lines of code calls after each page on the Web site is parsed. Arachnid's download package contains two spider applications ... More arachnid information

  • Spindle Spider

    Spindle is a Web index/search tool built on the Lucene toolkit. It includes an HTTP spider for creating an index and a search class to search for those indexes. The spindle project provides a set of JSP tag libraries that enable JSP-based sites to add search functionality without the need to develop any Java classes. ... More Spindle spider information

  • Arale Spider

    Arale is primarily designed for personal use, and does not focus on page indexing like any other crawler. Arale can download an entire Web site or some resources from a Web site. Arale is also able to map dynamic pages into static pages. More Arale Spider Information

  • Weblech

    Weblech is a powerful web site download and mirroring tool. It supports the ability to download Web sites by functional requirements and to mimic the behavior of standard Web browsers as much as possible. The Weblech has a feature console and is multi-threaded. More Weblech Information

  • Websphinx

    Websphinx is an interactive development environment for Java class packages and web crawlers. Web crawlers (also known as robots or spiders) are programs that can automatically browse and process Web pages. Websphinx consists of two parts: the crawler platform and the Websphinx class package. More Websphinx Information

  • Web crawler Heritrix

    Heritrix is an open source, extensible web crawler Project. Users can use it to grab the resources they want from the web. Heritrix is designed to be strictly in accordance with the robots.txt document exclusion instructions and meta robots tags. The best thing about it is that it's good scalability and allows users to implement their own crawl logic. Heritrix is a reptile frame, its tissue knot ... More Heritrix Information

  • Web crawler yacy

    YaCy-based distributed web search engine. It is also an HTTP cache proxy server. This project is a new way to build a Web-based index network. It can search your own or global index, You can also crawl your own web page or start a distributed crawling. More yacy Information

    Last updated: YaCy 1.4 Released, distributed web search engine posted 2 year ago

  • Search Engine Nutch

    Nutch is an open source Java-implemented search engine. It provides all the tools we need to run our own search engine. Includes full-text search and web crawlers. Nutch's founder is Doug Cutting, who is also the founder of Lucene, Hadoop and Avro Open source projects. Nutch was born in August 2002 and is an Apache-owned Java implementation ... More Nutch Information

    Last updated: Apache Nutch 1.10 Released, search engine posted 1 month ago

"Turn" 44 Java web crawler open source software

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.