Nutch is an open source Java-implemented search engine. It provides all the tools we need to run our own search engine. Includes full-text search and web crawlers.
Nutch's founder is Doug Cutting, who is also the founder of Lucene, Hadoop and Avro Open source projects.
Nutch was born in August 2002 and is a Java-implemented open source search engine project by Apache, Since the Nutch1.2 version, Nutch has evolved from a search engine to a web crawler, then nutch further evolved into two major branches: 1.X and 2.X, the biggest difference between these two branches is 2.X of the underlying data storage is abstracted to support a variety of underlying storage technology.
During the evolution of Nutch, Hadoop, Tika, Gora, and crawler Commons four Java open source projects. Today, these four projects are fast-growing, extremely hot, especially Hadoop, which has become a de facto standard for massive data processing. Tika uses a variety of existing open source content resolution projects to extract metadata and structured text from multiple formats of files, Gora supports persisting big data to a variety of storage implementations, and Crawler Commons is a common web crawler component.
Nutch is committed to making it easy for everyone to configure world-class web search engines at a fraction of the cost. To accomplish this ambitious goal, Nutch must be able to:
Take billions of pages per month
Maintain an index for these pages
Thousands of searches per second for index files
Provide high-quality search results
operate at minimal cost
online Javadoc:http://tool.oschina.net/apidocs/apidoc?api=nutch2.0
Nutch is an open-source Java-implemented search engine