Application practice of HBase in millet Tri Jianwei

Source: Internet
Author: User
Keywords We very practice if

March 25, 2014, CSDN Online training: HBase in the application of millet in the practice of a successful conclusion, the trainer is from the Tri Jianwei of millet, he said with the gradual expansion of millet business, especially the arrival of large data era, the original relational database MySQL has been gradually unable to meet the needs, So it's natural to move to NoSQL.

CSDN Online training is designed for the vast number of technical practitioners in the online real-time interactive technology training, inviting all industry first-line technical engineers to share their work encountered in various problems and solutions, at the same time to bring you some new technologies, ideas, solutions!

CSDN Online training to "classical, practical, systematic, forward-looking, professional" characteristics of the curriculum, through video lectures, document sharing, whiteboard sharing, screen sharing, instructor online QA and other forms of learning methods, to help front-line engineers to use the fragmentation time to enhance the actual combat ability, improve the level of practice, and to achieve with the technical Master of Communication and interaction.

Due to the limited time of this training, the question and answer link has many problems lecturer has no time to reply, csdn specially prepared the training QA summary, to help us better review and summary of the training to learn the technical points, so that students faster grasp the relevant technical points, less detours. Here are some QA, more questions to csdn this event discussion post interactive: http://bbs.csdn.net/topics/390742064.

Q: Is the deployment of the cluster in Hadoop or CDH?

The 0.94 branches of the HBase community are currently in use.

Q: What do the Millet base platform group do?

Responsible for millet storage and computing platform development.

Q:hive performance is not as mapreduce as you write?

The advantage of hive is that large data analysis and processing are done with SQL, and the learning cost is lower. The hive conversion of Mr Jobs is optimized and sometimes even more efficient than the Mr Job he writes. There are also examples of poor hql statements that lead to inefficiencies, which require specific analysis of the converted Mr Job logic.

Q: I have a hbase cluster with read and write operations. The write operation has a peak every day, and the query will be very slow each time after one months of running smoothly. My question is why does restarting not solve the problem every time you run into this situation? But after manual compaction and split, this problem was solved. Help to analyze it.

There may be a lot of reasons for slow queries. Compaction will merge hfile, delete data, delete expired data, for query efficiency is very large; after Split region, it triggers the region compact, which can also help improve query efficiency. In general, the restart cluster is not directly related to the improvement of query efficiency. In addition, the reading performance of HBase should be mainly related to the ratio of memory and hard disk, and the hard disk read delay is larger. Are your data accesses completely random or are you accessing more recently written data? If you are accessing more recent write data, general hit memory probability is very large, reading efficiency will not increase with the amount of data to decline quickly; if it is completely random read, the data is large, you need to read from the hard drive ratio of synchronization, reading performance may be more obvious, poor reading performance when Ioutil may be very high bar.

Q: What is the biggest problem you have encountered when using hbase, and how to solve it 1.1 points?

You should have encountered many challenges, such as high availability, performance. Mainly through the input to understand the code, optimize the implementation, add more debugging information clear issues and fault summary and other ways to gradually solve.

Q: How is GC optimized in the process of using hbase?

With GC log focused attention on Xmn/survivorratio/maxtenuringthreshold and the number of concurrent GC threads, GC relies on tuning parameters to mitigate the problem, and ultimately focus on reducing memory garbage and fragmentation from the code level.

Q: What is the version of the JDK you are using now?

1.6.3X, not officially used 1.7.

Q: Previously talked about a number of clusters of waste, want to ask millet in the energy-saving aspects of what work?

For offline services, building large off-line clusters allows business to share resources. Statistics CPU disk utilization, looking for the possibility of optimization.

Q: How to achieve the two level index in HBase?

The local level two index will use the atomicity of the same Region transaction, the Key Delimiter Prefix Region The split Policy of the split Policy; the global two-level index is based on global cross-row transactions (we experimented with global level two transactions, Principle with Google Percolator).

Q: Can you introduce the HBase compaction optimization?

Compaction we have planned some optimization work, see: https://issues.apache.org/jira/browse/HBase-9528

Q: If the cluster number of region has reached 5,000, each time on the offline longer, do not know millet on the region on-line time is not optimized?

For cluster upgrades, we do rolling_update, and before each upgrade shuts down region server, the region move to the other region server is scripted, region in memory data flush, Reduce the time behind Hlog replays. In addition, Region server concurrency Restart will be done later.

Q: Millet cluster Each machine configuration is the same, what are the typical configuration (CPU audit, memory, hard disk, hard disk speed)?

Some of the businesses that read more and write less have tried SSDs. Machine typical configuration see PPT page5. The use of custom machines or the purchase of manufacturers such as Lenovo, Huawei and other machines.

Q: What are the advantages of the structured storage service of millet?

Based on HBase, it is highly scalable and highly available, and supports both server-side and client-mode access.

Q: How fast is your company's cluster response? Can you give me a general introduction?

Random speed is about 2 to 5ms; Random reading speed is about 3-10ms.

Q:hbase real-time reading is not very good, what is the improvement of the scheme?

Read performance is mainly to see the cache hit rate, as long as the hit rate high real-time read performance is good, we have optimized the HBase block cache elimination algorithm, hot data will also help the hit rate. When the read request penetrates to the HDFS level or the physical disk level below, the actual read performance may depend on the underlying disk IO capability, and currently HDFS we implement the hedged read feature to optimize the read request delay, and a multiple block reader in the development plan, And in the OS cache hit rate we have not carried out the relevant analysis and optimization guidance.

Does the yarn in Q:hadoop 2 have a performance impact on HBase? Is it okay to cooperate with spark?

The first question is whether to run hbase on yarn, or Mr to process hbase data? The former has no practice, the latter and MR1 should have no obvious difference.

The second problem is that spark support is currently running on yarn or can handle hbase data, but Spark0.9.0 is not perfect for security cluster (Kerberos) support.

Q: How data is collected and stored during operation and maintenance monitoring?

Cluster metrics are reported through JMX, which we collect periodically through the program and then stored to OPENTSDB.

Q: Can you guarantee strong consistency under the new writing model of Hlog?

It is guaranteed that Writehandler will wait for the underlying asyncsyncer sync Maxtxid greater than its own txid to return.

Q: I ask millet at that time in the selection of databases, have you considered MongoDB? Why did you finally choose HBase and discard MongoDB?

HBase has advantages in scalability, reliability, Fault tolerance and is more suitable for large-scale data scenarios.

Q: Ask a question about the hbase version. If the number of versions of a unit is too high, will the reading performance be degraded? Like storing 10,000 versions? (This requirement comes from the need to store a idlist in a unit.) )

If the row is an RPC read-back, if the line is too large, it may affect read performance; At present, we prefer lanky rows.

Original link: http://www.csdn.net/article/2014-04-01/2819083-HBase-Hadoop

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.