Hadoop + Lucene + nutch
Hadoop implements Google's GFS and mapreduce algorithms, making hadoop a distributed computing platform. Hadoop is not only a distributed file system for storage, but also a framework designed to execute distributed applications on a large cluster composed of general-purpose computing devices.
Lucene is a Java high-performance full-text index engine toolkit that can be easily embedded into a variety of practical applications for full-text index search. As an application, nutch is a Lucene-based search engine application. Lucene provides text search and index APIs for nutch, in addition, the data capture function is provided.
Before the version of nutch0.8.0, hadoop was a part of the nutch. From the beginning of nutch0.8.0, NDfS and mapreduce were separated from it to create a new open-source project hadoop, the architecture of nutch0.8.0 has undergone fundamental changes over the past, and is fully built on hadoop.