Hadoop ++ is a non-invasive Optimization of hadoop map reduce. It improves query and connection performance by customizing functions such as split in hadoop framework. The project is hosted by Professor Jens dittrich at the University of Saarland, Germany. The project homepage is http://infosys.uni-saarland.de/hadoop?#.php.
Hadoop ++ optimizes hadoop in three aspects: Trojan index, Trojan join, and Trojan layout.
1. Trojan Index
The core of Trojan index is to organize data into split composed of data, indexes, headers, and footer in sequence. footer is the split separator, the last footer must be at the end of the file. Mapreduce sorts indexes during indexing. During query, the split function parses each Split Based on the footer information from the end of the file. The itemize function quickly locates the content that meets the condition based on the search range condition.
Compared with the database technology, Trojan index is similar to the index organization table.
2. Trojan join
Trojan join divides related records from multiple tables into one split Based on the join property and organizes them into a structure similar to Trojan index. The records generated by itemize also contain the attributes of both parties involved in the join, in this way, you no longer need to use map, shuffle, or reduce to calculate the join Based on the join attribute during query.
Compared with the database technology, Trojan join is similar to multi-Table clustering.
3. Trojan Layout
Similar to Pax, the data organization method inside the block combines attributes frequently accessed in queries. Layout is used for different replicas. Calculate the optimal layout based on the load, similar to a backpackAlgorithm.
Similar to the database technology, Trojan layout is similar to a vertical partition. The highlight is that different copies use different vertical partitions.