"Cloud Pioneer" star Ring TDH: Performance significantly ahead of open source HADOOP2 technology Architecture Appreciation

Source: Internet
Author: User
Keywords Large data machine learning data mining Hadoop
Tags analysis apache application applications asia pacific based business cache

Star Ring Technology's core development team participated in the deployment of the country's earliest Hadoop cluster, team leader Sun Yuanhao in the world's leading software development field has many years of experience, during Intel's work has been promoted to the Data Center Software Division Asia Pacific CTO. In recent years, the team has studied large data and Hadoop enterprise-class products, and in telecommunications, finance, transportation, government and other areas of the landing applications have extensive experience, is China's large data core technology enterprise application pioneers and practitioners.

Transwarp data Hub (referred to as TDH) is the most domestic case of one-stop Hadoop distribution, is the leading domestic and foreign large data base software, performance significantly ahead of open source Hadoop2. TDH applications cover a wide range of sizes and different data volumes of enterprises. Through memory calculations, highly efficient indexing, execution optimization, and highly fault tolerant technologies enable a platform to process 10GB to 100PB of data and provide faster performance per order of magnitude than existing technologies; Enterprise customers no longer need a hybrid architecture, TDH can accompany enterprise customer data growth, dynamic non-stop expansion To avoid intractable problems with MPP or mixed-schema data migrations. The following is an interview with Sun Yuanhao, founder and CTO of Star Ring Information Technology (Shanghai) Co., Ltd.

CSDN: First, please introduce yourself, the company and the technical team, the focus of your company at present?

Sun Yuanhao: Hello, my name is Sun Yuanhao, the co-founder of the Star Ring Information Technology (Shanghai) Co., Ltd., who has 10 years experience in IT industry, has been engaged in the development of BIOS, driver, operating system, compiler and distributed system. Star Ring Technology is a large data field of High-tech companies, committed to the development of large data base software, platform product named Transwarp data Hub (TDH), is a one-stop Hadoop and spark large data platform, providing complete SQL support, Rich R language mining capabilities and faster performance. The reason is that the one-stop platform, because TDH processing large and small data, performance can be faster than traditional data processing technology, users do not need to migrate between multiple platforms, and do not need to mix the architecture.


Founder and CTO Sun Yuanhao, star Ring Information Technology (Shanghai) Co., Ltd.

Star Ring Technology development team from Intel, Microsoft, IBM, NVidia, Baidu and other well-known technology companies, as well as Nanjing University, Fudan, Shanghai Jiaotong University, China University of Science and Technology, Princeton University and other well-known universities, Also include the abandonment of overseas well-known enterprises preferential treatment of the members of the home business. Star Ring Technology's core team is involved in the deployment of the country's earliest enterprise-class Hadoop cluster.

At present, the focus of the company's focus on product development and team building. The research and development investment continues to strengthen, welcome young people who are passionate and like the development of large data base software to join the star ring.

CSDN: Why choose the big Data industry to start a business, is to see what kind of market opportunity? What was the original purpose? Is there a story to share behind?

Sun Yuanhao: There will be a technological revolution in the IT industry every 5-10 years, while distributed systems and large data technologies are rebuilding the entire ecosystem of data-processing software from the bottom up, they are being accepted by businesses at a rapid pace, from early Internet companies to today's businesses, trying and using new technologies. In the past, the use of Hadoop technology in the traditional enterprise process, we also found some of the weaknesses of Hadoop itself and enterprises in the application of these new technologies difficulties, few enterprises have the financial and strength to buy thousands of servers to complete a single task, enterprises need more functional, cost-effective technology.

Especially in China, because of the large number of users, Chinese enterprises generally have more data than foreign enterprises in a order of magnitude; the application scenario of Chinese companies is also very complex, and few foreign products can be run without modification in China. China's telecom operators, banks, transportation and other fields, data volume and complexity far exceed the foreign similar enterprises, the need for a new generation of data processing technology to rescue. This is the original purpose of our star-ring technology, dedicated to providing great data base software to solve these problems.

Another big background is that most of the domestic enterprise core database system from foreign companies, we expect in the next 10 years, enterprise data Center will emerge a number of domestic enterprises and excellent products, and gradually replace the status of foreign companies. And with the help of Apache Hadoop and Spark, we can and foreign companies stand at the same starting line, the simultaneous development of products, and competition, we have the confidence and ability to make excellent products for the Chinese customers to provide better service.

CSDN: Your TDH is the most domestic case of the Hadoop distribution, can you give us a detailed discussion of the TDH technical framework, the build process of this version, what technology? Differentiate the advantages of other distributions at home and abroad?

Sun Yuanhao: TDH products have been developed to the third largest version. The current TDH3.3 consists of four parts, including the Transwarphadoop Foundation, the Transwarp inceptor Interactive analysis engine, the Transwarp hyperbase real-time database, and the Transwarp stream stream processing engine. The following figure shows this version of the component in more detail.


One of our improvements to Hadoop is mainly focused on HDFs and yarn, mainly provides high-speed erasure code coding, suitable for the Near-line storage class applications, can reduce one disk capacity requirements, while increasing the fault tolerance performance. This feature is designed and implemented primarily for customers with PB levels, such as single data in telecommunications, sensor data in the traffic/power industry, historical transactions in the banking sector, and so on. The main improvements on yarn are to enable yarn to more fully manage resources such as CPU memory, and to support Spark and map/reduce application clusters more effectively because our products are based on the default yarn, including Spark, So yarn is one of the core parts of TDH.

We offer three products on top of Hadoop. The current flagship product is based on Spark interactive analysis and mining engine inceptor, there are three-tier architecture (see below), the bottom is a distributed cache (Transwarpholodesk), can be built on memory or SSD, the middle tier is Apache The Spark computing engine layer, which includes the SQL ' 99 and PL compilers, the statistical algorithm library and the Machine Learning Algorithm library, provides a complete R language Access interface. The main features of this engine are high performance, full SQL support, and a good support for the R language.


We have done a lot of optimizations for the spark itself and the SQL engine. The optimization of spark itself is concentrated on Dag execution scheduling and shuffle optimization, so that spark can handle large amount of data, of course Holodesk and index introduction also need to improve spark. The SQL engine we develop can automatically recognize hiveql,sql1999 and PL syntax. We have also developed various optimizations for the SQL engine, including the CBO. Compared to other distributions of Hadoop, when data is on disk, inceptor performance is 2 to 5 times times faster than standard Hadoop. When data is on a distributed memory or SSD, there is generally a 5 to 10 times times speedup. With popular reporting tools such as tableau connection inceptor, we typically load the full amount of data (usually TB level) to be analyzed into memory or SSD, with very smooth performance and real interactive data analysis.

Compared with Cloudera Impala, we have good performance, tpc-ds 99 cases, most cases we are faster than Impala, there are 9 slower than Impala, but our advantage is to be able to handle a lot of SQL, some of the intermediate results of the SQL is relatively low aggregation rate , resulting in the middle result is too large, due to the class Dremel architecture defects, this scenario impala has not been effectively processed, which leads to Impala in the data distribution changes or the large number of data will often be unable to run the results.

Inceptor supports the construction of a column-type storage (called Holodesk Columnar Store) on SSD solid-state disks, a distributed cache that differs from Tachyon in nature, and the storage format defaults to the table structure with local indexing. Because memory, SSD and mechanical hard disk speed is roughly 100:10:1, and the same capacity of memory, SSD, hard disk price ratio is also 100:10:1. The actual test found that the use of SSD replacement memory as a inceptor of the column cache, performance is not significantly reduced, so you can use the same price to buy 10 times-fold SSD as a cache, on the one hand with the pure memory cache close to the performance, On the other hand, the data can be 10 times times larger than the pure memory database or Apachespark processing. None of the other Hadoop distributions currently have this functionality.

The second advantage is that SQL support is complete, we currently support complete SQL1999, and are implementing more complex PL syntax, including stored procedures, functions, cursors, and so on. Cloudera recently released Impala 2.0 roadmap, the SQL functionality implemented by the end of 2014 our inceptor has been in the first half of this year. The complete degree of SQL support is more important than performance, a large number of data Warehouse/Data mart applications are more complex SQL99 syntax, without these syntax support, it is not feasible to migrate existing applications to Hadoop.

The third advantage is the integration with R language, we provide the interface of R language directly call the underlying machine learning algorithm and data mining algorithm, or in the distributed data set parallel to run the R language of the existing serial algorithm. Compared to the stand-alone version of the R language, the data can be processed by a large amount, you can analyze the total data, rather than sampling data. R language is a very powerful data mining and statistics language, itself also contains a powerful drawing library. We have developed an online recommendation system in some customers using r language, and the new algorithm is more accurate than the traditional recommendation method based on SQL statistics and collaborative filtering, and we will see a case report in the near future.

The Second product hyperbase on Hadoop is a real-time NoSQL database based on Apache hbase with multiple indexing techniques, distributed transaction processing, Full-text search, and graphic databases. Hyperbase can effectively support the enterprise's online OLTP applications, high concurrency OLAP applications, batch applications, Full-text search, or high concurrent graphics database retrieval applications. Another place with a special advantage is our Inceptor SQL engine supports Hyperbase, the user can use SQL99 to access the data in Hyperbase, performance compared to the API has a certain loss, but the concurrency and performance of SQL execution is still very excellent.

We currently use Hyperbase to create scalable online operations databases (operationaldatabase) or real-time analytical databases (ods-operational data Store) for the enterprise. Products similar to the Hyperbase feature set are based on HBase database products from Silicon Valley startups splice machine, whose SQL engine is technically transformed by Derby. Salesforce Open Source Phoenix also provides some SQL functionality, but maturity is far from sufficient to support real-world applications.

The third product on Hadoop is the Transwarpstream real-time streaming engine, which integrates Kafka distributed queues based on spark streaming, providing a wealth of streaming computing capabilities and supporting complex application logic. The current version is primarily to provide a stable, 7x24-hour flow-handling framework. The function we are developing is to use SQL to describe the flow-processing application logic, making it easier for developers to develop new streaming applications. Compared with the Apache storm, Spark streaming creatively put streaming real-time data on time slicing, batch processing of data within each short time interval, and the effect is close to the streaming system when the time interval is as small as 100 milliseconds. One of the great advantages of Spark streaming is that it is easy to use complex analysis tasks on stream data, even for streaming machine learning. This is very difficult to implement on the stream processing system, such as Storm, which is based on event-driven model.

CSDN: What are the main application scenarios? Current domestic usage and customer size, etc. Can you share the actual application case?

Sun Yuanhao: Since the inception of the star Ring, we have nearly hundred customers deployed our software, there are dozens of signed and run on the line of customers, this does not include our team has previously served customers. Here I give a few examples of application scenarios.

1. Operator Flow Management analysis: In one of our customers, the daily traffic data in the 2TB~5TB around, the data is copied to HDFs, through our inceptor interactive analysis engine products, operators can run hundreds of complex data cleaning and reporting business, Total time than similar hardware configuration of the minicomputer cluster and DB2 fast 2~3 times. In this application, because SQL is so complex that it is difficult to translate into the SQL currently supported by Open source Hadoop, the Inceptorsql engine of the star Ring has the advantage of being able to run these complex SQL easily and with a higher performance. 2. Log analysis of large Web servers: We have a CDN manufacturer to build a log analysis system, the manufacturer's caching device to support a large domestic Web server cluster, these Web servers every 5 minutes recorded on the click Log up to 800GB, Peak hits up to 9 million times per second. We load the data into memory every 5 minutes, compute the hotspot URL of the Web site, and feed this information back to the front-end Cdn cache server to increase the cache hit rate. The increase in hit rate increased revenue for CDN vendors. This system is 7x24-hours uninterrupted. 3. Real-time analysis of the video monitoring information of the city traffic card: Another case of 7x24 uninterrupted operation is in the intelligent transportation industry, we use Transwarp stream for the province-wide traffic card port through the video surveillance information collected by real-time analysis, alarm and statistics (calculate real-time road conditions), For the province-wide annual inspection of vehicles or sets of car analysis delay of about 300 milliseconds, you can make real-time alarm, so the best friends to drive a timely inspection. 4. IPTV rating statistics and on-demand recommendation: There is a well-known domestic IPTV operators using our products to build a real-time rating statistics and on-demand recommendation system, can be real-time collection of user remote control operations, providing real-time ratings list, and according to the content recommendation and collaborative filtering algorithm, to achieve the VOD recommendation service.

CSDN: For TDH, what is the most concern for customers at the moment? What kind of solutions do you have?

Sun Yuanhao: For the current TDH customers, the TDH performance, the level of SQL support and our services are more satisfied, some customers are not satisfied with just using the TDH to do SQL statistical analysis, they care about using TDH can also do what new applications. This is a process to explore new applications, but also a market education process. In response to this situation, on the one hand we have set up a team dedicated to helping customers explore new applications, such as in the field of off-line analysis, we are developing more statistical algorithms and machine learning algorithms to support new applications, and to help customers apply a new machine learning algorithm for in-depth data analysis. On the other hand, we are also building ecosystems, working with more partners to support partners in the development of new application solutions.

CSDN: As a distributed architecture guru, do you have any good experience to share in the construction of a large-scale Hadoop cluster?

Sun Yuanhao: This problem is quite big, can write an article: I write later!

Cloud Edge: China "cloud Pioneer" series of reports serial number company name established time Ceo/cto official micro-blog company product/direction 1. Yun Yu with 2012 Chen Ben


website is fit for 2. Friends 2010


Yao Hongyu


@ Friends microblogging C, C + +, Java product development


3. Aggregation Data


2010


Zole


@ Aggregated Data tube mobile data Service 4. Anchora 2009 Lu Weimin





Mopaas and Inpaas


5. Fast enough for 2012 years


Chiang Shuo Miao @ fast enough technology


Cloud Storage


6. Evans Hai Fai


2012 Wu Kai


@ Evans OpenStack Public Cloud


7. Sohu Cloud 2011 Chu Yingbo


Sendcloud


8. Lenovo Cloud Storage 2009 Luo Jinjin


Cloud storage 9. Nanjing She Janxia
2012

large data real-time analysis 10. Shanghai
2012

Golden Sword





Cloud management, cloud storage


11. Guo Technology


2010


ji Kai


@ National Cloud Technology Cloud operating system


12. SSO365 2012 Jian





Cloud security, Cloud identity authentication


13. Cloudil Cloud Project 2001 years


Yes @ Century Ding Li


Communication operator


14. Multiple backup


2013 Hu Maohua


@ Wooden Wave cloud Backup


15. Shanghai Wang Software 2011


based on cloud construction station software supermarket


16. Cloud Wisdom 2009 Yinjin @ Monitoring Treasure cloud monitoring, based on large data APM 17. Shenzhen Zeyun 2012 He Gianbin


high-performance Storage System 18. Shenzhen Wisdom Crown 2004 Lu Huili


biological identification and virtualization of hand veins 19. Beijing Vauan Technology 2009 Cao Xuewu @ Vauan Mobile video Technology provider 20. Star Ring Information Technology 2013 Sun Yuanhao @ Star Ring Tech Data analysis Platform Note: September 5, 2014 update, continuous update ...
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.