1, http://www.oschina.net/project/tag/64/spider? Lang = 0 & OS = 0 & sort = view &
- Search EngineNutch
Nutch is a search engine implemented by open-source Java. It provides all the tools we need to run our own search engine. Including full-text search and web crawler. Although Web search is a basic requirement for roaming the Internet, the number of existing Web search engines is declining. and this is likely to evolve into a company that monopolizes almost all web... more information |
|
Latest updates:[One blog per day] the study on the regular expression filtering mechanism of the URL of nutch was published 20 days ago.
- Web CrawlerGrub Next Generation
Grub next generation is a distributed Web Crawler system that includes clients and servers that can be used to maintain web page indexes. More grub Next Generation Information |
Latest updates:Grub next generation 1.0 was released three years ago
- Website data collection softwareNetwork miner collector (original soukey picking)
Soukey-based website data collection software is an open-source software based on the. NET platform. It is also the only open-source software of the website data collection software type. Although soukey picking is open-source, it does not affect the provision of software functions, or even richer than some commercial software functions. Soukey picking currently provides the following main functions: 1. Multi-task multi-line... more network miner collector (original soukey picking) Information |
- Web Crawler and search engine in PHPPHPDig
PHPDig is a web crawler and search engine developed using PHP. Create a vocabulary by indexing dynamic and static pages. When searching for a query, it will display the search result page containing the key words according to certain sorting rules. PHPDig contains a template system that can index PDF, Word, Excel, and PowerPoint documents. PHPDig applies to more... More PHPDig Information |
- Website content collectorSnoopy
Snoopy is a powerful website content collector (crawler ). Provides functions such as obtaining webpage content and submitting forms. More Snoopy Information |
- Java Web CrawlerJspider
Jspider is a Java-implemented webspider. The execution format of jspider is as follows: jspider [url] [configname] URL must contain the protocol name, such as http: //. Otherwise, an error is reported. If configname is saved, the default configuration is used. Jspider behavior is configured by the configuration file, such as what plug-in is used, result storage party... more information about jspider |
- Web CrawlerNwebcrawler
Nwebcrawler is an open-source C # web crawler. |
- Web CrawlerHeritrix
Heritrix is an open-source and scalable Web Crawler project. Users can use it to capture desired resources from the Internet. Heritrixis designed to strictly follow the exclusion instructions and meta robots labels in the robots.txt file. Its outstanding advantage lies in its excellent scalability, allowing users to easily implement their own crawling logic. Heritrix is a crawler framework that organizes contents... more information about heritrix |
- Web Crawler frameworkScrapy
Scrapy is a twisted-based Asynchronous Processing framework. It is a crawler framework implemented in Python only. You only need to develop several modules to easily implement a crawler, it is very convenient to capture webpage content and various images ~ More scrapy Information |
Latest updates:Use scrapy to create a website. The website was published six months ago.
- Vertical CrawlerWebmagic
Webmagic is a crawler framework that requires no configuration and facilitates secondary development. It provides simple and flexible APIs and can implement a crawler with only a small amount of code. The following code crawls the oschina blog: Spider. Create (New simplepageprocessor ("http://my.oschina.net/", "http://my.oschina.net/#/blog/*"). T... more webmagic Information |
|
Latest updates:Webmagic 0.5.2 released, Java crawler framework released one month ago
- Openwebspider
Openwebspider is an open-source multi-threaded web spider (ROBOT: crawler) and a search engine that contains many interesting functions. More information about openwebspider |
- Java multi-thread Web CrawlerCrawler4j
Crawler4j is an open-source Java class library that provides a simple interface for capturing web pages. It can be used to build a multi-threaded web crawler. Sample Code: Import Java. util. arraylist; import Java. util. regEx. pattern; import Edu. UCI. ICS. crawler4j. crawler. page; import Edu. UCI. ICS. cr... more information about crawler4j |
- Web page capture/Information Extraction SoftwareMetaseeker
The metaseeker (gooseeker) v4.11.2 software toolkit for web page capture/Information Extraction/data extraction is officially released. The online version is free to download and use. You can read the source code. Since its launch, it has been very popular in the main application fields: vertical search (also known as professional search). High-speed, massive, and precise crawling is the topic web crawler datascrap... more metaseeker Information |
- Java Web Spider/Web CrawlerSpiderman
Spiderman-another Java Web Spider/crawler Spiderman is a web spider Based on the microkernel + plug-in architecture, its goal is to capture and parse complex target webpage information into the business data you need in a simple way. Main features * flexible, highly scalable, microkernel + plug-in architecture, Spiderman provides up to... more Spiderman Information |
- Web CrawlerMethanol
Methanol is a modular and customizable web crawler software. Its main advantage is its high speed. More methanol Information |
- Web Crawlers/Web CrawlersLarbin
Larbin is an open-source Web Crawler/web spider, developed independently by French young man s é Bastien ailleret. Larbin aims to track the URLs of pages for extended crawling, and finally provides a wide range of data sources for search engines. Larbin is just a crawler, that is to say, larbin only crawls web pages. As for how to parse, it is up to the user... more larbin Information |
- CrawlersSinawler
The first crawler program targeting Weibo data in China! The original name is "Sina Weibo crawler ". After logging on to the console, you can specify a user as the starting point, and use the user's followers and fans as the clues to collect basic user information, Weibo data, and comment data from different connections. The data obtained by this application can be used as data support for scientific research and development related to Sina Weibo, but it cannot be used by vendors... more information about sinawler |
- [Free] Dead link check softwareXenu
Xenu link sleuth may be the smallest but most powerful software you have ever seen to check website dead links. You can open a local webpage file to check its link, or enter any URL to check it. It can list the active and dead links of the website separately, and analyzes the redirection Links clearly. It supports multiple threads and can check the link... more Xenu Information |
- Web-harvest
Web-harvest is a Java open source web data extraction tool. It can collect specified web pages and extract useful data from these pages. Web-harvest mainly uses technologies such as XSLT, XQuery, and regular expressions to perform text/XML operations. More information about web-harvest |
- Web crawling toolsPlayfish
-
Playfish is a Web crawling tool that uses Java technology and is integrated with multiple open-source Java components, the xml configuration file is used to achieve highly customizable and scalable Web page capturing. Open-source jar packages for applications include httpclient (content reading), dom4j (Configuration File Parsing), and Jericho (HTML parsing ), it is already in the Lib of the war package. This |
- Easy-to-use Network Data Collection System
The system uses mainstream programming languages PHP and MySQL databases. You can use custom collection rules or Download shared rules from my website for websites or website groups, collect the data you need. You can also share your collection rules with everyone. Use the data explorer and editor to edit the collected data. All the code of this system is completely open-source,... more information about the Network Data Collection System |
- Web CrawlerYacy
Yacy is a P2P-based distributed Web search engine. it is also an HTTP cache proxy server. this project is a new method for building a P2P web index network. it can search for your own or global indexes, Crawl's own web pages, or start distributed crawling. more yacy Information |
Latest updates:Yacy 1.4 was released, and the distributed Web search engine was released one year ago.
- Web Crawler frameworkSmart and simple Web Crawler
Smart and simple web crawler is a web crawler framework. Integrate Lucene support. This crawler can start with a single link or a link array and provides two traversal modes: Maximum iteration and maximum depth. You can set a filter to limit the crawling back links. By default, three filters are provided: serverfilter, beginningpathfilter, and regulare... For more information, see smart and simple web crawler. |
- Web CrawlerCrawlzilla
Crawlzilla is a free software that helps you build a search engine. With it, you don't have to rely on the Collection Engine of commercial companies, there is no need to worry about indexing the company's internal website data. The question is handled by the nuch case as the core and more multi-phase relational suites are integrated, and develop the design security and management UI, making it easier for users to get started. In addition to crawling basic... more information about crawlzilla |
- Simple HTTP CrawlerHttpbot
Httpbot is a simple encapsulation of the java.net. httpurlconnection class. It can easily obtain webpage content, automatically manage sessions, and automatically process 301 redirection. Although it cannot be as powerful as httpclient and supports the complete HTTP protocol, it is very flexible and can meet all my current needs .... More httpbot Information |
- News collectorNzbget
Nzbget is a news collector. The materials downloaded from the news group are in the NZB format. It can be used in standalone and server/client modes. In standalone mode, the NZB file is used as the command line parameter to download the file. Both the server and client have only one executable file "nzbget ". Functions and features console interface, use plain text, color text or... more nzbget Information |
- Web CrawlerEx-Crawler
Ex-crawler is a web crawler developed in Java. This project is divided into two parts: a daemon process and a flexible and configurable web crawler. Use a database to store webpage information. More ex-crawler Information |
- Recruitment Information CrawlerJobhunter
Jobhunter is designed to automatically obtain recruitment information from some large sites, such as ChinaHR, 51job, and Zhaopin. Jobhunter searches for the email address of each work item and automatically sends the application text to this email address. More jobhunter Information |
- Web Crawler frameworkHispider
Hispider is a fast and high performance spider with high speed strictly speaking, it can only be a framework of a spider system. There is no need for refinement. Currently, it can only extract URL, URL deduplication, asynchronous DNS resolution, queue tasks: supports distributed downloading on N machines and targeted downloading on websites (hispiderd must be configured. INI whitelist ). features... more hispider Information |
- Perl CrawlerCombine
Combine is an open and scalable Web Resource crawler developed in Perl. More combine information |
- Web CrawlerJcrawl
Jcrawl is a small web crawler with excellent performance. It can capture various types of files from webpages, based on user-defined symbols, such as email and QQ. More jcrawl Information |
- Distributed Web CrawlerEbot
Ebot is a Scalable Distributed Web Crawler developed using the Erlang language. URLs are saved in the database and can be queried through restful HTTP requests. More ebot Information |
- Multi-thread Web CrawlerSpidernet
Spidernet is a multithreaded web crawler based on a recursive tree. It supports retrieving text/html resources. you can set the crawling depth, maximum number of bytes to download, Gzip decoding, GBK (gb2312) and UTF-8 encoded resources, and store them in SQLite data files. in the source code, todo: Mark and describe incomplete functions. You want to submit your code .... more spidernet Information |
- Itsucks
Itsucks is an open-source Java Web Spider (Web robot, crawler) project. Download rules can be defined by downloading templates and regular expressions. Provides a swing GUI. More information about itsucks |
- Web Search CrawlerBlueleech
Blueleech is an open-source program that starts from the specified URL and searches for all available links and links above the links. When searching, it can download all or pre-defined content pointed to by the link. More information about blueleech |
- URL monitoring scriptUrlwatch
Urlwatch is a Python script used to monitor the specified URL address. Once the specified URL content changes, it will be notified by email. Basic functions are easy to configure. You can specify a URL in a text file and a URL in a line. Easily hackable (Clean Python implementation) can run as a cronjob and M... more URL Watch information |
Latest updates:Urlwatch 1.8 was released four years ago.
- Methabot
Methabot is a high-speed and configurable web, FTP, and local file system crawler software. More methabot Information |
- Web search and crawlerLeopdo
Web searches and crawlers written in Java, including full-text and classified vertical searches, and more leopdo information in the Word Segmentation System |
- Web Crawler toolsNcrawler
Ncrawler is a web crawler tool that allows developers to easily develop applications with Web Crawler capabilities and can be extended, allows developers to expand its functions to support other types of resources (such as archives such as PDF, word, and Excel ). Ncrawler uses multiple threads (... more ncrawler Information |
- Ajax crawling and TestingCrawljax
Crawljax: written in Java and open source code. Crawljax is a Java tool for automated crawling and testing of current Ajax web applications. |