Apache Accumulo |
The Apache accumulo sorted, distributed Key/value Store is based on Google ' s BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift. It features a few novel improvements on the BigTable design in the form of Cell-level access labels and a server-side prog Ramming mechanism that can modify key/value pairs at various points in the data management process. Categories:database Languages:java Pmc:apache Accumulo |
Apache Ambari |
Apache Ambari makes Hadoop cluster provisioning, managing, and monitoring dead simple. Categories:big-data Languages:java, Python, JavaScript Pmc:apache Ambari |
Apache Avro |
Apache Avro is a data serialization system. Categories:library, Big-data LANGUAGES:C, C + +, C #, Java, PHP, Python, Ruby Pmc:apache Avro |
apache Chukwa |
chukwa are an open source data collection system for monitoring large Distributed systems. Chukwa is built on top of the Hadoop distributed File System (HDFS) and map/reduce framework and inherits Hadoop ' s Scalabi Lity and robustness. Chukwa also includes a? Exible and powerful toolkit for displaying, monitoring and analyzing results of the collected data. categories:hadoop Languages:java, Javascript Pmc:apache chukwa |
Apache Drill |
Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop D ATA Storage Systems. It was inspired on part by Google ' s Dremel. Categories:big-data Languages:java Pmc:apache Drill |
Apache giraph |
Apache Giraph is a iterative graph processing system built for high scalability. For example, it's currently used at Facebook to analyze the social graph formed by users and their connections. Categories:big-data Languages:java Pmc:apache Giraph |
Apache Hadoop |
Hadoop is a distributed computing platform. This includes the Hadoop distributed Filesystem (HDFS) and an implementation of MapReduce. Categories:database Languages:java Pmc:apache Hadoop |
Apache Hama |
The Apache Hama is an efficient and scalable general-purpose BSP computing engine which can being used to speed up a large VA Riety of compute-intensive analytics applications. Categories:big-data Languages:java Pmc:apache Hama |
Apache HBase |
Use Apache HBase Software if you need random, realtime Read/write access to your Big Data. This project's goal is the hosting of very large tables--billions of rows X millions of columns--atop clusters of Comm Odity hardware. HBase is a open-source, distributed, versioned, column-oriented store modeled after Google ' s bigtable:a distributed Stor Age System for structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides bigtable-like c Apabilities on top of Hadoop and HDFS. Categories:database Languages:java Pmc:apache HBase |
Apache Hive |
the Apache Hive (TM) Data Warehouse software facilitates querying and managing large datasets residing in Distributed storage. Built on top of Apache Hadoop (TM), it provides * tools-to-enable easy data extract/transform/load (ETL) * A mechanism to Impose structure on a variety of data formats * Access to files stored either directly in Apache HDFS (TM) or in other DAT A storage systems such as Apache HBase (TM) * Query execution via MapReduce Hive defines a simple sql-like query language, Called HiveQL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who is familiar with the MapReduce framework to being able to plug I n their custom mappers and reducers to perform more sophisticated an analysis of that is not being supported by the built-in CAPAB Ilities of the language. HiveQL can also is extended with custom scalar functions (UDF's), aggregations (UDAF ' s), and table functions (UDTF ' s). Categories:database Languages:java Pmc:apache Hive |
Apache Lucene Core |
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It's a technology suitable for nearly any application that requires Full-text search, especially cross-platform. Categories:database Languages:java Pmc:apache Lucene |
Apache Mahout |
Scalable Machine Learning Library Categories:library Languages:java Pmc:apache Mahout |
apache Nutch |
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project had diversified and now comprises-codebases, namely:nutch 1.x:a well mature D, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which is great for batch processing. Nutch 2.x:an Emerging alternative taking direct inspiration from 1.x, but which differs in one key area; Storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent Mappings. This means we can implement a extremely flexibile model/stack for storing everything (fetch time, status, content, parsed Text, Outlinks, InLinks, etc.) into a number of NoSQL storage solutions. Being pluggable and modular of course has it ' s benefits, Nutch provides extensible interfaces such as Parse, Index and Sco Ringfilter ' s for custom implementations e.g. Apache Tika for parsing. Additonally, Pluggable indexing exists for Apache SOLR, Elastic Search, etc. Nutch can run on a single machine, but gains a lot of it strength from running in a Hadoop cluster Categories:web-framework Languages:java Pmc:apache Nutch |
Apache Oozie |
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java Map-reduce, streaming map-reduce, Pig, Hive, Sqoop and DISTCP) as well as system specific jobs (such as Java programs and Shell scripts). Categories:big-data Languages:java, JavaScript Pmc:apache Oozie |
Apache Pig |
Apache Pig is a platform for analyzing large data sets this consists of a high-level language for expressing data analysis Programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs was that their structure was amenable to substantial parallelization, which in turns en Ables them to handle very large data sets. Pig ' s infrastructure layer consists of a compiler that produces sequences of map-reduce programs. Pig ' s language layer consists of a textual language called Pig Latin, which has the following key properties: * Ease of PR Ogramming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations is explicitly encoded as data flow sequences, makin G them easy to write, understand, and maintain. * Optimization opportunities. The which tasks is encoded permits the system to optimize their execution automatically, allowing theUser to focus on semantics rather than efficiency. * Extensibility. Users can create their own functions to do special-purpose processing. Categories:database Languages:java Pmc:apache Pig |
Apache Spark |
Apache Spark is a fast and general engine for large-scale data processing. It offers high-level APIs in Java, Scala and Python as well as a rich set of libraries including stream processing, Machin e Learning, and graph analytics. Categories:big-data Languages:java, Scala, Python Pmc:apache Spark |
Apache Sqoop |
Apache Sqoop (TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Categories:big-data Languages:java Pmc:apache Sqoop |
Apache Storm |
Apache Storm is a distributed real-time computation system. Similar to about Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general PRI Mitives for doing real-time computation. Categories:big-data Languages:java Pmc:apache Storm |
Apache ZooKeeper |
Apache ZooKeeper is a effort to develop and maintain an Open-source server which enables highly reliable distributed coor Dination. Categories:database Languages:java Pmc:apache ZooKeeper |