My Chinese name is Yu Zhihong. In the HBase community, more people know about Ted Yu. has been a member of the HBase PMC since June 2011.
At the HBTC 2012 conference, I was pleasantly surprised by the wide adoption of hbase in China. Famous companies such as Taobao, Huawei, Intel and IBM all recognise HBase. HBase has become a de facto nosql solution in the United States, a trend that is more pronounced in China.
I am delighted to see the company's contribution to HBase. I was impressed by Huawei's secondary indexing (Level two index) in each presentation that introduced the NoSQL course, and the user started asking when the solution would become an open source solution. I think Huawei's management should open up the source code of its solution to catch the wave and let Huawei gain more recognition in the HBase community. One of the complexities of open source development is that any closed solution is at risk of being eliminated by intense competition from other players.
Let me explain the need for open source. Software development has undergone tremendous changes since Linux was built as a de facto operating system for enterprise clusters (later cloud computing platforms). Regardless of the size of a single company's internal team, its resources cannot be compared to the resources of the open source community. The reason for this is that the solution that serves the needs of many communities will be more universal, and the collective wisdom of the community is superior to that of in-house talent. Developers, especially those who have just contacted the open source movement, should look more aggressively at the feedback from the code review process. Many times, other developers, especially those who focus on specific areas of development, can more quickly discover design flaws or problems that everyone ignores. We should think of the open source process as a process of elevating the solution to a higher level.
Along this line of thinking, I would like to emphasize the importance of good communication, especially written English. The HBase community covers many countries around the world. Our authors were from the United States, India, China, Russia, Turkey and Germany. Using English as the official language allows those who interpret their ideas in clear English to have a certain advantage in persuading the community to accept their methods.
My speech at the HBTC 2012 conference was titled "Enterprise-Oriented HBase". In my opinion, more and more enterprises will adopt hbase, so it is necessary to provide the following functions.
1. Multi-tenant.
EBay, Yahoo, Taobao and other companies have large internal clusters, to provide services to multiple teams. How to guarantee SLAs in a shared cluster is one of the most needed features for users. There are many ways to meet this requirement. One of these is the group area server, where each team's calculations have dedicated resources. However, this results in low utilization of the entire cluster hardware. That is, sometimes it is difficult to find the right partition because of the different team usage. Another approach is priority/distribution requests, which allow each zone server to receive a relatively average workload. This approach is more difficult to implement.
2. Snapshot.
Oracle databases provide transaction, table, and database-level rollback capabilities. Providing snapshots, especially online snapshots, makes it critical for businesses to trust HBase that are highly demanding for data persistence. Developers of Cloudera, Salesforce and Hortonworks are actively developing this feature. The first phase of implementation focuses on the preparation of the underlying architecture. Online snapshots have been integrated into version 0.96. It is now being fetched to version 0.94.
3. Multiple pre-write logs (Write-ahead-log).
Currently, there is only one Wal for each zone server, which becomes a bottleneck in providing services to a large number of write workloads. Some companies run multiple zone servers on the same physical server, and the physical server has a stronger hardware configuration. Multiple Wal will mitigate this problem. This feature is excellent in design. We must take into account the impact of existing Wal logic, replication, and placement of additional loads on Namenode. There are measures to provide a write service before the cluster is fully recovered from a disaster. It also relies on Wal support. See HBase-5699. This feature is also associated with an incremental snapshot, which takes into account the Wal from the selected table. See the discussion in HBase-7912.
4. Flexible handling of key values.
Currently, the row, column family, column qualifier, timestamp, and value of each key value are saved in a contiguous byte array. This limits access speed and wastes a lot of memory because of redundancy. Block encoding in HBase 0.94 eases this problem, but is only the first step towards a final solution. HBase-4676 introduces the cell Interface, which no longer requires key-value components (rows, columns, etc.) to be stored together, which makes it a step further. This, coupled with prefix trie encoding, greatly increases the utilization of block caching. Another related task is to reduce the number of copies of the same data in the process of transfer (block caching, IPC threads) inside the zone server. You can refer to HBase-7898.
5. Cross-Regional affairs.
Currently hbase only supports transactions that are bound within a single zone. This strictly restricts the design of the architecture. Users have been asking for more general transaction support. There are several open source projects that will loosen such restrictions, such as Yahoo's Omid. I hope the hbase core will integrate this capability.
6. Level Two index.
Users with a relational database background will naturally expect higher query performance through the use of level two indexes. Although it is difficult to define the exact meaning of the two-level index in the NoSQL database, we can come up with models to serve multiple, if not all, use cases. Higher query performance will encourage users to migrate from relational databases to HBase.
7. Enhanced security.
Some applications place sensitive data in hbase, which requires stricter security rules. HBASE-7897 adds tag support to the cell Interface, laying the groundwork for achieving this goal. According to performance data, embedding ACLs as tags into key values shows better performance than using a dedicated ACL column family. The core contributor Andrew Purtell provides a reference implementation in "HBase-6222: adding security for each key value."
8. Easy to use.
Compared with MongoDB, HBase does not require tuning after initial deployment. We should provide hbase tuning parameters. For example, when the load tends to read heavily, more heaps should be provided to the block cache. Memstore should use more heaps when the load has a large number of writes.
I would also like to mention several projects built on top of hbase. From Intel's Panthera, the SQL interface enables better analysis on Hadoop/hbase. Wasp from Taobao is an implementation of Google Megastore. From Cloudera's Impala, it is being tested and open source process. The Phoenix from Salesforce provides the SQL interface in Query hbase. I believe that there will soon be a winner in these interactive technologies that can be implemented and hbase more user-friendly. As for the winners, I can think of several features: open source, strong community support, and the possibility of becoming an Apache incubator project.