"Turn" 44 Java web crawler open source software

Last Update:2015-06-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original address Http://www.oschina.net/project/lang/19?tag=64&sort=time

Minimalist web crawler Components WebFetch

WebFetch is a micro crawler that can run on mobile devices, without relying on minimalist web crawling components. WebFetch to achieve: No third-party dependent jar packages reduce memory usage increase CPU utilization Accelerate network crawl speed simple and straightforward API interface can be run stably on Android devices Small and flexible can be easily integrated web crawler components use ... More WebFetch Information

Open Source Crawler Framework Guozhongcrawler

Guozhongcrawler is a non-configurable, easy two-time development of the crawler open source framework, it provides a simple and flexible API, only a small amount of code to implement a crawler. The design inspiration is derived from the summarization of reptile framework at home and abroad. A fully modular design that covers the entire crawler lifecycle (link extraction, page download, content extraction 、... More Guozhongcrawler Information

Web crawler Kamike.collect

Another simple Crawler another network crawler, can support proxy server Fq crawl. 1. Data exists in MySQL. 2. When using, first modify Web-inf/config.ini data link related information, mainly the database name and user name and password 3. Then access the Http://127.0.0.1/fetch/install link to automatically create the database table ... More Kamike.collect Information

Web version crawler Spider-web

Spider-web is the web version of the crawler, which uses XML configuration, supports crawling of most pages, and supports the saving, downloading, etc. of crawling content. Where the configuration file format is: <?xml version= "1.0" encoding= "UTF-8"?> <content> <url type= "simple" ><!--simple/ Complex---<url_head>http://www.oschina .... More Spider-web Information

Ugly Cow Mini Collector

Ugly Cow Mini collector is a Java swing based on the development of professional network data acquisition/information mining processing software, through the flexible configuration, can easily and quickly crawl from the Web page structured text, pictures, files and other resource information, can be edited after the selection process to publish to the site Architecture description system is based on Swing+spring-3.2.4+mybatis-3 ... More information about the Ugly bull mini-collector

Java crawler Webcollector

Crawler profile: Webcollector is a Java crawler framework (kernel) that is not configurable and facilitates two development, providing a streamlined API that enables a powerful crawler with just a small amount of code. Crawler Kernel: Webcollector is committed to maintaining a stable, extensible crawler core, enabling developers to develop flexibly and two times. The kernel has a very strong ... More Webcollector Information

WEB Data Pump Client Webstraktor

Webstraktor is a programmable WWW data extraction client that provides a scripting language for collecting, extracting, and storing data from the WEB, including images. The scripting language uses regular expressions and XPath syntax. The standard output is in XML format and supports ASCII, UTF-8, and Iso885_1. Provides logging and tracking information. ... More Webstraktor Information

Network Data capture Framework Tinyspider

Tinyspider is a network data grabbing framework based on tiny Htmlparser. MAVEN reference coordinates: <dependency> <groupId>org.tinygroup</groupId> <artifactid>org.tinygroup.spider </artifactId> <version>0.1.0-SNAPSHOT</version> </dependency> web crawler, generally used in full-text inspection ... More Tinyspider Information

Script programming Language Crawlscript

The web crawler script language on the Java Platform Crawlscript web crawler is a program that automatically obtains web page information, there are a lot of Java, C + + network Crawler library, but on the basis of these class library development is very cumbersome, requires a lot of code to complete a simple operation. In view of this problem, we developed the Crawlscript script language, the process ... More Crawlscript Information

Based on Apache Nutch and Htmlunit Extension Implementation Ajax page crawler Crawl parsing plugin Nutch-htmlunit

Nutch Htmlunit Plugin Project introduction based on the Apache Nutch 1.8 and Htmlunit components, the full page content fetch parsing for the AJAX load Type page is implemented. According to the implementation of Apache Nutch 1.8, we can ' t get a dynamic HTML information from fetch pages including AJ. . More Nutch-htmlunit Information

Last updated: Nutch-htmlunit 1.8 Release: Implementation of AJAX page crawler crawl parsing plugin based on Apache Nutch and Htmlunit extensions posted 10 month ago

Web crawler Goodcrawler

Goodcrawler (GC) Web crawler GC is a vertical domain crawler and a search engine for unpacking. GC is based on HttpClient, Htmlunit, Jsoup, Elasticsearch. GC Features: 1, a template with DSL features. 2, distributed, extensible. 3, Xin Htmlunit, it can better support JavaScript. 5, Hopewell ... More Goodcrawler Information

Vertical crawler WebMagic

WebMagic is a crawler framework that does not need to be configured and facilitates two development, providing a simple and flexible API that allows a crawler to be implemented in just a small amount of code. Here is a piece of code to crawl the Oschina blog: spider.create (New Simplepageprocessor ("http://my.oschina.net/", "http://my.oschina.net/*/ blog/* ")) .... More webmagic Information

Last updated: WebMagic 0.5.2 Released, Java Crawler Framework posted 1 year ago

Retrieving the crawler frame Heydr

Heydr is a Java-based lightweight, open-source, multi-threaded vertical search crawler framework that follows the GNU GPL V3 protocol. Users can build their own vertical resource crawler through HEYDR, which is used to prepare data for vertical search engines. More HEYDR Information

Opm-server-mirror

Code update 2009-11-25: Added anti-crawler functionality. The direct Web Access server will jump to Google. Use the method download Index.zip unzip index.zip get index.php to upload index.php to support PHP and curl on the foreign server open http://www.your_website.com/your_folder_if _any/, if the page jumps to goo ... More Opm-server-mirror Information

Java Web spider/web crawler spiderman

Spiderman-another Java web spider/Reptile Spiderman is a micro-kernel + plug-in architecture of the network spider, its goal is to use a simple method to the complex target Web page information can be captured and resolved to their own needs of business data. Key Features * Flexible, scalable, micro-core + plug-in architecture, Spiderman provides up to ... More Spiderman Information

Web Search and crawler leopdo

Web search and crawlers written in Java, including full-text and categorical vertical search, and word breakers more leopdo information
OWASP AJAX Crawling Tool

OWASP Ajax crawling Tool (Fuzzops-ng) OWASP produced Ajax crawlers, written in Java, open source code. More owasp AJAX crawling tool information
Ajax Crawler and Test Crawljax

Crawljax:java Write, open source code. Crawljax is a Java tool for automating crawling and testing Ajax WEB applications today. More Crawljax Information

Common Crawl

The Commoncrawl Source Library is a custom InputFormat fulfillment implementation for Hadoop. Common Crawl provides a sample program Basicarcfilereadersample.java (located in org.commoncrawl.samples) used to configure InputFormat. ... More common crawl Information

Data collection System Chukwa

What is Chukwa, simply said it is a data collection system that collects all kinds of data into Hadoop-ready files for Hadoop to perform various MapReduce operations. Chukwa itself also provides a number of built-in features that help us collect and collate data. For a more simple and intuitive display ... More Chukwa Information

Simple HTTP crawler Httpbot

Httpbot is a simple package for the Java.net.HttpURLConnection class, it can easily get the Web content, and automatically manage the session, automatically handle 301 redirects and so on. Although not as powerful as httpclient, it supports the full HTTP protocol, but is very flexible enough to meet all of my current needs. ... More Httpbot Information

Web Mining Toolkit Bixo

Bixo is an open source Web mining toolkit that is developed and run based on Hadoop. By creating a custom cascade assembly, you can quickly create web mining applications that are specifically optimized for specific use cases. More Bixo Information

Web crawler Crawlzilla

Crawlzilla is a free software that helps you to build your search engine easily, and with it, you don't have to rely on the company's search engine, and you don't have to worry about it. The worries of the company's internal Web site index is the core of the Nutch project, integrating more related packages and developing a design installation and management UI, Make it easier for users to get started. Crawlzilla In addition to crawling basic ... More Crawlzilla Information

Web crawler Ex-crawler

Ex-crawler is a web crawler, using Java development, the project is divided into two parts, one is the daemon, and the other is a flexible and configurable Web crawler. Use a database to store Web page information. More Ex-crawler Information

Web crawler Playfish

Playfish is a web crawler that uses Java technology to integrate multiple open source Java components, and uses XML configuration files to achieve highly customizable and scalable web crawlers using open source jar packages including httpclient (content reading). DOM4J (Profile parsing), Jericho (HTML parsing), is already under the lib of the war package. This item ... More Playfish Information

Web crawler jcrawl

Jcrawl is a small, high-performance web crawler that can fetch various types of files from a Web page, based on user-defined symbols such as EMAIL,QQ. More jcrawl Information

Java Multi-threaded web crawler crawler4j

CRAWLER4J is an open source Java class Library that provides a simple interface for crawling Web pages. It can be used to build a multi-threaded web crawler. Example code: Import java.util.ArrayList; Import Java.util.regex.Pattern; Import Edu.uci.ics.crawler4j.crawler.Page; Import Edu.uci.ics.cr ... More crawler4j Information

Web crawler framework Smart and simple Web Crawler

Smart and simple web crawler is a web crawler framework. Integrated Lucene support. The crawler can start with a single link or an array of links, providing two traversal modes: maximum iteration and maximum depth. You can set the filter limit crawl back link, default provides three filters ServerFilter, Beginningpathfilter and Regulare ... More smart and simple Web crawler information

Tools to generate PDFs from URLs h2p

A solution that generates a bookmarked PDF document based on a bulk URL. H2p-file is an XML file that mainly describes the URL's information and the hierarchy of URLs, and H2p-tool generates bookmarked PDF documents based on H2p-file. The hierarchy of URLs can also be presented directly through XSL, and the support of the co-site to H2P will be simple ... More h2p Information

Web Search crawler Blueleech

Blueleech is an open source program that starts with the specified URL, searches for all available links, and links above them. It can download the contents of all or predefined ranges that are encountered by the link while searching. More Blueleech Information

Job Information crawler Jobhunter

Jobhunter is designed to automatically get recruitment information from a number of large sites, such as Chinahr,51job,zhaopin and more. Jobhunter searches the email address of each work item and automatically sends the application text to that email address. More Jobhunter Information

Java web crawler Jspider

Jspider is a Java implementation of the Webspider,jspider execution format as follows: Jspider [url] [configname] URL must be added to the protocol name, such as:/HTTP, otherwise it will error. If ConfigName is omitted, the default configuration is used. Jspider behavior is configured by the configuration file, such as the use of what plug-in, the result of the storage side ... More Jspider Information

Itsucks

Itsucks is a Java web spider (Web bot, crawler) Open source project. Download templates and regular expressions are supported to define download rules. Provides a swing GUI operator interface. More itsucks Information

Web-harvest

Web-harvest is a Java open source Web data extraction tool. It collects the specified Web pages and extracts useful data from those pages. Web-harvest mainly uses such techniques as xslt,xquery and regular expressions to realize the operation of Text/xml. More web-harvest Information

Jobo

Jobo is a simple tool for downloading an entire Web site. It is essentially a web Spider. Compared to other download tools, its main advantage is the ability to automatically populate a form (e.g., auto-login) and use cookies to process the session. Jobo also has flexible download rules (e.g., URL, size, MIME type, etc.) to restrict downloads. ... More Jobo Information

Larm

Larm can provide a pure Java search solution for users of Jakarta Lucene search engine framework. It contains methods to index files, database tables, and crawlers that index Web sites. More Larm Information

Arachnid

Arachnid is a Java-based web spider framework. It contains a simple HTML parser capable of parsing an input stream that contains HTML content. By implementing the Arachnid subclass, you can develop a simple web Spiders and the ability to add a few lines of code calls after each page on the Web site is parsed. Arachnid's download package contains two spider applications ... More arachnid information

Spindle Spider

Spindle is a Web index/search tool built on the Lucene toolkit. It includes an HTTP spider for creating an index and a search class to search for those indexes. The spindle project provides a set of JSP tag libraries that enable JSP-based sites to add search functionality without the need to develop any Java classes. ... More Spindle spider information

Arale Spider

Arale is primarily designed for personal use, and does not focus on page indexing like any other crawler. Arale can download an entire Web site or some resources from a Web site. Arale is also able to map dynamic pages into static pages. More Arale Spider Information

Weblech

Weblech is a powerful web site download and mirroring tool. It supports the ability to download Web sites by functional requirements and to mimic the behavior of standard Web browsers as much as possible. The Weblech has a feature console and is multi-threaded. More Weblech Information

Websphinx

Websphinx is an interactive development environment for Java class packages and web crawlers. Web crawlers (also known as robots or spiders) are programs that can automatically browse and process Web pages. Websphinx consists of two parts: the crawler platform and the Websphinx class package. More Websphinx Information

Web crawler Heritrix

Heritrix is an open source, extensible web crawler Project. Users can use it to grab the resources they want from the web. Heritrix is designed to be strictly in accordance with the robots.txt document exclusion instructions and meta robots tags. The best thing about it is that it's good scalability and allows users to implement their own crawl logic. Heritrix is a reptile frame, its tissue knot ... More Heritrix Information

Web crawler yacy

YaCy-based distributed web search engine. It is also an HTTP cache proxy server. This project is a new way to build a Web-based index network. It can search your own or global index, You can also crawl your own web page or start a distributed crawling. More yacy Information

Last updated: YaCy 1.4 Released, distributed web search engine posted 2 year ago

Search Engine Nutch

Nutch is an open source Java-implemented search engine. It provides all the tools we need to run our own search engine. Includes full-text search and web crawlers. Nutch's founder is Doug Cutting, who is also the founder of Lucene, Hadoop and Avro Open source projects. Nutch was born in August 2002 and is an Apache-owned Java implementation ... More Nutch Information

Last updated: Apache Nutch 1.10 Released, search engine posted 1 month ago

"Turn" 44 Java web crawler open source software

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Turn" 44 Java web crawler open source software

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Turn" 44 Java web crawler open source software

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support