"Pioneer" Simin software large data technology platform to create the process and impala real-combat sharing

Source: Internet
Author: User
Keywords Cloud computing Impala Simin data cloud pioneer
Tags analysis application applications apply based big data business business people

In the interview of Simin data Liu Chengzhong, he said that the current large data domain enterprise-level market rely on technology monopoly to obtain high profits of the game is outdated, the cost of technology will continue to decline, this is the general trend, the market giant will appear in the technology is very good, but better service companies. From the user's point of view, the user's first concern is how to make the data value, then the solution depends on what kind of technology, whether it can quickly apply, whether it can adapt to the next possible expansion, relative technology, 1th is more difficult.

In fact, today's corporate customers, especially in the field of large data technology, but also need long-term partners. They not only need to buy technology-intensive products, also need to work with large data technical experts to study how to make the data play value, use the rich experience of technology company with the enterprise existing business collide, explore new Data application scenario, this is the customer needs most, is also the Simin data is good at. The next step is to consider what technical scenarios to use. An experienced large data company is not only dealing with large data, it should be able to help enterprise customers find data, bring data, and then integrate a reasonable data model, and then consider the presentation, and finally reflected in the day-to-day decision-making of the enterprise, the formation of operational-data-decision-making benign closed loop.

Simin data is a way to embrace open source but adhere to the independent research and development of the route, they provide products also provide solutions. In order to be compatible with major distributions, Simin's large data base platform remains fully standardized and open, with the main platform for bug fix tracking and patch tracking, resolving the dependencies of various components, releasing a stable, validated large data platform version. For the Simin software large data technology platform for the specific process of building, we interviewed Simin Software technology manager Liu Chengzhong, the following is an interview record.


Simin Software technology Manager Liu Chengzhong

Simin Data team, positioning, advantages

CSDN: First Introduce yourself and Simin data, what is the current technical team situation?

Liu Chengzhong: I 2008 graduated from Beihang University Computer department, in the VMware China Research and development work for more than 4 years, do network virtualization and virtual machine online migration optimization work, next in the domestic leading advertising technology company second hand system is responsible for the design of distributed systems, currently in Simin data as a research and development manager, Responsible for the development of large data technology platform. Simin data is a new local large data technology company, our core technical team are basically computer, mathematics and informatics background, more than 90% are Tsinghua, Beida, Beihang, Bupt, China, CMU, such as the university graduates at home and abroad, can be said to be the highest density of domestic large data technology companies.

CSDN: At present, enterprise-class large data application, implementation and analysis of the field, domestic and foreign market situation? What is the location of the Simin data? What are the unique advantages?

Liu Chengzhong: Traditional Data application Analysis this piece, basically is IBM, global software giants such as HP and Oracle, which are based on stand-alone and then scalable solutions, are performance-focused solutions, and closed-source technology closures form technology monopolies that have made a big profit over the past few decades. The technology upgrade We are currently facing comes from the explosion of information from the society as a whole, resulting in a surge in the amount of data that is available to be processed, and there is no revolutionary fundamental change in the traditional architecture, such as quantum computers, which have an extended program that is largely X86. Through linear scaling to cope with the growth of the data, this gives a thriving opportunity for internal practice from Google and then by some of Yahoo's engineers to design the last popular Hadoop technology route.

Basically emerging big data solutions, are built on a relatively simple and inexpensive Distributed File System (HDFS), designed around key points where large data mobility costs are high, and gain performance benefits and scalability through architecture, with better scalability and lower cost compared to traditional scenarios. Of course, traditional software vendors are also trying to adapt to the technology trend, some companies will be the original products and Hadoop tools to integrate such as Oracle, and some will work with the Hadoop commercial release to create a total solution such as EMC and MAPR cooperation, So overall in the current enterprise-wide data application analysis, especially the interactive analysis of this piece, or traditional business software transformation and emerging based on open source standard commercial products PK trend. But it's worth noting that the entire product family trend based on the Hadoop community has been formed, and this route is unstoppable, which means it's hard to recreate the underlying infrastructure of Hadoop and get everyone's approval support, Ali's technical team can put oceanbase for so many years insist on doing it is a miracle, I personally respect this.

Simin data is to embrace open source but adhere to the independent research and development of the route, we provide products also provide solutions. Simin's large data base platform will remain completely standardized and open for compatibility with major distributions, and we are mainly bug fix tracking and patch tracking in the platform, resolving dependencies on various components, releasing stable, validated large data platform versions. On this basis, we provide enterprise-class use of essential functional components, including operation and maintenance management, task management, user audit, access security, access control, real-time analysis engine and other core components, on the other hand, the development of rich top-level applications, our real-time analysis engine is the first to integrate MPP and iterative calculation of the hybrid engine, Masks the complexity of the various components below, providing a consistent SQL interface to the upper application; The Data mining platform is dedicated to making it easy for ordinary business people to create and train models so that business people can easily transform data scientists Visual presentation platform allows customers to quickly create a HTML5 based on similar data cube such a report tool, intuitive experience of the power of data, the Data Factory has the industry's leading large data real-time incremental synchronization function. In general, Simin based on a solid technical foundation to provide a standardized platform to build, but also a strong upper application development capabilities to help customers play the value of data.

Figure: Simin Large data product diagram:


users are most concerned about what

CSDN: How is the user distribution of Simin data, is there some heavyweight customers?

Liu Chengzhong: We have served the user has covered the financial, retail, communications and other fields, typical customers such as China UnionPay, postal Savings Bank, CCTV, China Unicom, National Bureau of Statistics, Suningyun, Gome online, Guizhou power grid and so on, these customers have a common point is that they have a wealth of data, There is an urgent need to refine information from these data to further guide decision-making. Generally speaking, customers are divided into two categories, a simple enterprise information architecture technology upgrades, we will provide large data technology platform products to help enterprises upgrade; the other is facing new data-driven business, need information technology support, we will start from the business, set up a complete solution. Therefore, Simin data can be said to be one of the few in the country, not only can provide the foundation of large data platform, but also to provide the business closely related to the upper level of large data applications, the most complete large data solution provider.

CSDN: From the customer's point of view, the most concerned about the question? How do you deal with it?

Liu Chengzhong: Customers first care about the value of the data I face, and then what kind of technology the solution relies on, whether it can be applied quickly, and whether it can adapt to possible extensions later. 1th is harder than technology, this requires close cooperation with the business side, Simin's technical team rooted in the field of data mining applications for many years, has a wealth of experience to help enterprises do a variety of data-driven business promotion, in fact, today's corporate customers do not want a company to sell products, especially in the emerging technology sector, Simin is more willing to do is long-term business partners, our technical experts sit together with the business for months, discussing how to make the data worthwhile, bumping into what our customers already have, and exploring new data scenarios that we're best at, It is also in our opinion that the customer needs most.

The 2nd is the technical solution, such as how to build a large data base for the underlying storage computing, but this is just a basic infrastructure, which is only part of the real enterprise solution. Big data technology has to be landed on customers, there are countless hidden costs, need to consider including ETL, operation and maintenance management, authority audit, business applications, visual display, and many other links, Simin data is currently the largest data to provide solutions to the most complete technology companies, our products cover from the data migration, Data base platform to data mining application, the full stack of data display, the benefit is to be able to ensure maximum consistency to customer service, reduce delivery cost, in the most agile way to enable customers rapid business promotion.

Simin Software Large data technology platform to build the course

CSDN: Can you share the process of creating a large data technology platform for Simin software?

Liu Chengzhong: Experience a lot, learn more, the entire technical team is basically on the pit, like I came from the second hand system, but also from ebay, Baidu, cool these internet companies come over colleagues. Take me for myself, My team and I started a distributed database cluster based on PostgreSQL9.1 (PG) In 2012, the use of some of the level of partitioning of the common way, with 10 machines to achieve the second level of query TB data target, at that time the team of 3 people, the main focus on the design of the metabase, data on how to efficiently import the cluster, and the use of the file system has There is a mechanism to facilitate the implementation of a task workflow, but the parsing of SQL is very weak, almost can only run the simplest SQL, so the scope of use is very limited.

But we later hack the Cloudera Impala in 2013, using Impala as the engine for the PG cluster, with good SQL coverage and no Impala performance. At the end of 2012, we use C + + based on PG, RABBITMQ Message Queuing itself to implement a set of distributed storage computing platform, the software in the various modules are filled with a variety of programmers like names such as Amoeba, on the line after running so far, Dealing with the massive statistics logs for billions of ad exposures per day, a real-time report and a daily batch report, now looks like a mix of storm and Hadoop.

At that time, a graduate of Tsinghua University with a template class is very concise implementation of the map and reduce primitives, and then we built the core batch processing module, based on this development of Message Queuing flow through the various operational modules, and finally use the PostgreSQL database as a result of the summary, There are a number of such experiences, in 2012 we used KFS cluster has hundreds, many problems KFS development team have not encountered, can only maintain a version of their own. We've been developing these systems on our own, and we've been looking at the progress of the Hadoop community, trying and performing comparisons, and not being satisfied with the stability and debugging of Hadoop, Until the Hadoop2.0 came out, we thought that the trend of open source standards had been formed before we could start switching to the use of Hadoop, and the technical team was happy to try out a variety of products in the Hadoop community.

In general, most of our technical team has a similar lesson, that is, the large data base technology development costs are very high, it is a very expensive to develop these highly complex systems is a very cheap thing, in the current open source technology mature, A tightly coupled mainstream standard technology is a responsible approach to the future, it's also a more secure scenario, with an example of hive support for SQL when it came out, and a lot of people in the open source community have developed parser to support better SQL syntax, but most of these projects stopped 2013 years later, Because the back of the hive0.12,0.13 rushed quickly, we found that the use of standard hive more easily and the entire eco-circle product synergy, so they have turned back. In my opinion, open source is similar to three security statements, a subdivision of the technical direction of the call, we together to create quality things, rather than mutual ignorance of each other, such an organization brings benefits is greatly reduced after the technical upgrade of the risk of being eliminated.

So for Simin, we offer business services, customer on line such a large-scale information architecture system to face what the risk is our first consideration elements, eat so many years of closed-source business software losses, customers understand can not be locked in a manufacturer's platform, if you want to change other platforms should be able to seamlessly switch past, This requires us to provide our customers with the industry-standard common technology architecture, our independent research and development products are added functionality to the platform rather than aggressive modification, to ensure that the overall portability is not affected, which will benefit the customer behind the technology upgrades, reduce risk. (Go to next page)

CSDN invites you to participate in China's large data award-winning survey activities, just answer 23 questions will have the opportunity to obtain the highest value of 2700 Yuan Award (a total of 10), speed to participate in it!

The China Large Data Technology Conference (Marvell conference 2014,BDTC 2014) will be held at Crowne Plaza Beijing New Yunnan December 12, 2014 14th. Heritage since 2008, after seven precipitation, "China's large Data technology conference" is currently the most influential, the largest large-scale data field technology event. At this session, you will not only be able to learn about Apache Hadoop submitter uma maheswara Rao G (a member of the project Management Committee), Yi Liu, and members of the Apache Hadoop and Tez Project Management Committee Bikas Saha and other shares of the general large data open source project of the latest achievements and development trends, but also from Tencent, Ali, Cloudera, LinkedIn, NetEase and other institutions of the dozens of dry goods to share. For a limited ticket discount, advance booking is expedited.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.