Baidu technology Salon (Phase 1)-2. Experience in Internet Application Service Expansion

Source: Internet
Author: User

Source Address:Http://www.infoq.com/cn/presentations/maruyue-ls-data-processing

 

At the 2nd event of Baidu technology Salon (December May 15, we invited Baidu distributed Senior Engineer Ma ruyue and freewheel's core system technical director Wang Di to share the two topics about distributed and service expansion, this article will give a brief summary of their speech content and provide you with download of demonstration documents.

Contribute to the development of hadoop

In his speech, Ma ruyue introduced Baidu's large-scale data storage, data analysis, and data indexing, including the following:

  • Large-scale data storage

    • Lustre and HDFS
    • System Structure
    • HDFS advantages and disadvantages
  • Large-scale data analysis
    • MPI and mapreduce
    • Mapreduce conceptual model and Implementation Model
    • Mapreduce-hadoop implementation
  • Large-scale data index
    • Comparison between MySQL and hbase
    • Hbase details
  • Baidu's problems, countermeasures and principles in the above three aspects

Among them, Ma ruyue mentioned that Baidu's current data volume to be processed is very large: 20 Pb + data is stored, 10 TB + data is added every day, and 1 Pb + of data is processed every day, submit more than 10 K jobs every day. The current file system is HDFS, and the data storage is hbase. There are more than 2 k server nodes, each of which is 2*4 cores. One of the most difficult problems encountered now is the namenode bottleneck: the pressure on namenode is very high due to the need to store a large number of (small) files. They just purchased a 48 gb memory, however, this 48 gb memory is expected to only last until the end of this year. At that time, 96 GB memory may be purchased to deal with this problem urgently. Therefore, Baidu has made a lot of research on the distributed namenode. Ma ruyue recommends that you:

If you are interested in this aspect, refer to the CEpH File System in Linux 2.6.34, which is a Distributed File System Based on Pb scale.

Finally, Ma ruyue mentioned several questions/directions that Baidu is currently focusing on/solving. He suggested that if you want to make some achievements on hadoop, these directions are also the current hot topics:

  • Distributed improvement of HDFS namenode
  • Asynchronous read/write of HDFS datanode
  • Distributed improvement of mapreduce's jobtracker
  • Mapreduce new job and Task Scheduler
  • Mapreduce's hadoop C ++ extension framework

Some readers are very interested in hadoop C ++ extensions. Ma ruyue explained how to use Baidu hadoop:

We will regularly find a stable version on the official version of hadoop, and then perform custom development. After a while, if we find that the official version adds many new features, it is much better than us. We will open a new branch to move our features. Our engineers are developing the C ++ extension of hadoop. We probably split it out in version 0.19. So far, we have found that chunk still cannot run Baidu's own version, so we will not do the transplantation. HCe is developed in our version. Therefore, if it is transferred to chunk, it will be difficult and requires some adjustments, which will take some time. Our engineer just completed a version last week and will be able to provide a link for trial.

Data-driven

Wang Di is the technical director of freewheel's core system. Since he was founded in, he has been involved in the architecture design of its core advertising system, we also witnessed freewheel's initial log volume from only 20 ad servers, hundreds of thousands of visits per day, and less than 1 GB per day, up to now, we have 60 ad servers, 50 million ad requests per day, 8 log processing servers, and GB log processing every 4 hours. In the past three years, traffic increased by 20 times. He talked about the following experiences and principles:

  • Application Service Extension

    • Stateless Application Service
    • Replication and multi-layer Cache
  • Data warehouse extension
    • De-normalization/Normalization
    • Roll up/Data availability
    • Benchmarking and Query Optimization
    • Split-loading/sharding
  • Operating Principles
    • 50% Running Load Limit & n + 1 Data Center
    • Monitoring and Response
    • Multi-phase deployment

Many practical methods are developed based on their specific business models and actual work. They are not necessarily the "best", but they are the most suitable, for example, when the load on the system reaches 50%, It is a signal of optimization and expansion. For example, the system uses automated regression testing as the core, but does not use TDD unit testing.

Some readers are interested in how to organize test cases in regression testing. Wang Di explained:

For example, if we have 700 test cases and need QA to do some data, we can use an SQL file to store the data locally, and then store the request and expectation locally in the same way as the file, then, when the framework is running, load them into the database, and then extract them from the database after the service is completed.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.