Interview with csdn: Commercial storage in the big data age

Last Update:2014-07-06 Source: Internet

Author: User

Tags hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Address: http://www.csdn.net/article/2014-06-03/2820044-cloud-emc-hadoop
Abstract:As a leading global information storage and management product company, EMC recently announced the acquisition of DSSD to strengthen and consolidate its leadership position in the industry, we have the honor to interview Zhang anzhan of EMC China recently. He shared his views on big data, commercial storage, and spark.

Speaking of big data, Zhang anzhan believes that big data is essentially two fundamental issues. One is that big data is large and how to store it? The other is that the data is big. How can we analyze it? The first problem is that storage vendors build a better scalability storage system to meet the needs of ultra-large-scale data storage. Second, big data analysis. With the vigorous development and maturity of distributed computing/storage clusters represented by the hadoop ecosystem, big data analysis becomes more efficient and accurate, the original offline data mining can now be done online, or even generate recommendations based on the current behavior to users within several minutes through online mining.

Zhang anzhan, Senior Engineer of EMC China excellence R & D group, graduated from Nankai University and joined EMC after graduation, Serves as a software engineer in the storage department. During the course of study, I mainly engaged in the research and implementation of online reader aggregation Based on handheld readers. The encoding exceeds 30 kb and accumulated rich practical experience in coding.

It also studies online advertising based on geographical locations and has successfully developed a prototype on handheld readers. At the school recruitment, I received an offer from Baidu, Alibaba, sogou, EMC, SonicWall, innovation and other well-known enterprises, and kept the necessary record for the interview.

After joining EMC, he is mainly responsible for the research and development of storage system management and monitoring, and has accumulated rich system debugging experience; the storage control protocol SMIs has been designed and implemented, and practical experience has been accumulated in terms of system performance tuning. It undertakes the code management and scrum management of the project team. Over the past six months, we have focused on building a next-generation business storage management framework. By re-designing, we have achieved storage system reliability, availability, scalability, and performance) to improve the quality.In the next month, I will join Baidu's Webpage Search Department as a senior system architecture R & D Engineer, responsible for designing and upgrading the Webpage Search product Service Architecture and Data Storage Architecture.

Speaking of How to Learn hadoop and spark, he felt that it was necessary to read the source code and learn to compare it. In addition, Scala is the coolest language he thought. A good programmer will certainly like Scala. The following is an interview transcript of Zhang An station:

Csdn: Can I introduce my current work?

Zhang An station:Currently, the main task is to build the next-generation Management and Control Platform for EMC high-end storage. This is a brand new platform. Different from vnx2 released last year, vnx2 is actually divided into file and block. They use different CPUs and are physically isolated. The platform we are currently working on is truly unified. We can provide file service and block service on a node. With a new architecture, the reliability, availability, scalability, and performance of the entire storage system are improved. The scalability of traditional storage systems is scale-in, which cannot be scale-out. Therefore, you can see that the maximum number of hard disks supported by different product models in each system is fixed, so the maximum storage space is also determined. In order to expand the capacity, you have to buy more equipment, which undoubtedly increases the it O & M cost. We are now focusing on solving the limitations of traditional architectures and adapting to the new requirements of cloud computing and big data for storage systems, in this way, our products are still leading the development of storage systems in the new environment.

Unfortunately, February is the last month of my work at EMC. In early July, I joined Baidu's Webpage Search Department as a senior system architecture R & D Engineer, responsible for designing and upgrading the Webpage Search product Service Architecture and Data Storage Architecture, it includes web page capturing, massive data processing platform, and distributed retrieval system. I also officially started my big data career in the workplace.

Understanding of big data

Csdn: How do you understand big data?

Zhang An station:Big data, different people may have different understandings from different perspectives. However, in the final analysis, there are two fundamental issues: one is that the data is big, and how to store it? The other is that the data is big. How can we analyze it? The first problem is that our storage vendor builds a better scalability storage system to meet the needs of ultra-large-scale data storage. The second problem is big data analysis. With the vigorous development and maturity of distributed computing/storage clusters represented by the hadoop ecosystem, big data analysis becomes more efficient and accurate. The original offline data mining can now be done online, even online mining generates recommendations based on the current behavior for users within several minutes.

Therefore, we can say that the development of these technologies has also given birth to more business models and is changing our lives. For example, with the help of big data analysis, traffic violation monitoring can use less time to notify vehicles in violation of regulations; hospitals can use more user data to create better models for better treatment solutions; the financial industry can recommend the best financial products for users based on their investment behaviors. And none of these are closely related to our lives. Big Data is in the ascendant, and opportunities and challenges coexist. Let our cute programmers better serve the people!

Csdn: EMC recently acquired DSSD, a startup company. What do you think?

Zhang An station:EMC is a company that has acquired or "integrated" many companies. One of the most famous acquisitions in EMC's history was the acquisition of VMware for over $2003 in 0.6 billion. In fact, EMC's multiple acquisitions also reflect EMC's grasp and sensitivity to industry trends from another aspect. EMC has made multiple acquisitions, the continuous strengthening and consolidation of leadership in the industry also affects the development trend of the industry. This is another move that EMC has made in the Flash market after acquiring flash company xtremio. In fact, the peak of EMC mid-range storage released in 2013 vnx Rockies also released the VNX-F of the full flash array, the highest iops reached 110. Although EMC believes that the disk array will still exist in the foreseeable future, it can also be seen from its series of actions that EMC attaches great importance to the Flash market. The acquisition of DSSD is also part of this strategy.

The announcement on EMC World 2014 is sufficient to demonstrate the importance of DSSD. The core team of DSSD comes from ZFS. Zfs is the most advanced file system in the world. Why is ZFS? It is because Z is the last English letter. After that, no other file systems are needed. We will wait and see that the development team led by Andy Bechtolsheim will bring emc dssd to the market in 2015. Andy created sun when he was a doctor at Stanford. Therefore, there is enough reason to believe that Andy will surprise the flash memory market.

Opportunities and Challenges of traditional commercial storage

Csdn: EMC World 2014 mentioned the support of EMC storage products for Open-stack. Can you talk about specific content in this regard?

Zhang An station:How can commercial storage systems be integrated into the cloud computing cluster environment? After all, cloud computing provides three basic services: computing resources, network resources, and storage resources. The emcstorage Department focuses on how to integrate our storage products into open-stack so that open-stack can seamlessly use EMC storage products. Due to the relative independence of various EMC storage products, different product lines may support open-stack. Technically speaking, we can implement an open-stack cinder driver to implement some open-stack APIs so that open-stack can use storage resources on the storage system. Here we have to mention the VIPR Implementation of EMC's software-defined storage. VIPR 2.0 will become the core data platform for all its storage in the future. By adding support for the openstack clinder plug-in, VIPR can be compatible with a wider range of third-party storage systems and product drives. EMC believes that VIPR 2.0 can now handle 80% of all existing storage capacity.

However, for business value, EMC's support for open-stack is to better integrate our storage products into open-stack. Make full use of our storage services. In fact, like intel actively promoting many opensource projects, the ultimate goal is to make these open source projects better run on the company's core software and hardware platform. Of course, it is undeniable that the promotion of these large companies has played a very positive role in these projects. With the human and material resources invested by large companies, these open-source projects can all develop better in their respective fields.

Csdn: Opportunities and Challenges for traditional commercial storage in the cloud computing context?

Zhang An station:To solve this problem, we have to mention EMC's current third platform strategy. Simply put, the second platform is a traditional data center, and EMC has already taken the lead. The third platform is built on mobile devices, cloud services, social networks, and big data. However, today's technological development can be said to have redefined many things, just like the redefine topic of EMC World 2014. In this platform transformation process, some companies are doomed to be eliminated; some companies will be on the top of the new wave. EMC's traditional storage department will certainly be affected, but no one can assert what the impact is. We are now also redesigning our product architecture. Many modules are re-built to better adapt to the needs of the third platform.

Intensive Reading of source code is required

Csdn: Do you have any suggestions for students studying hadoop and spark?

Zhang An station:The most important thing to learn is interest and passion. It cannot be said that if you want to learn any technology, you may always follow these technologies and cannot truly improve yourself in terms of technology. From my own experience in hadoop and spark, it is necessary to read the source code carefully and read the source code carefully, especially when you need to perform secondary development. Of course, a blog is a very important channel, but a blogger extracts his own knowledge and summarizes it to form a blog. Maybe he knows 80% of the implementations, however, blog posts may only show 60%. Therefore, after reading the article carefully, you must go deep into the source code for intensive reading and comparison. For example, when I learn HDFS, every time I see a module, it will be horizontally compared with our company's product implementation. For example, HDFS servers are divided into namenode and datanode, while our products are divided into controlpath and datapath. Therefore, from the perspective of a large architecture, both of them can be figured out, although one is a distributed storage system and the other is a centralized storage system.

Spark is also learned, and the spark source code is undoubtedly more concise. I believe you can learn a lot from the source code. We all know that spark is implemented in Scala. Scala is the coolest language I think. A good programmer will certainly like Scala.

Conclusion:

In the interview, we can feel the passion, wisdom, and talent of Zhang anstation. As he said in his latest blog redefine: Change in the changing world, the development of science and technology has redefined technology, it affects your life and changes you and me.

ClickZhang anzhan blogTo view more technical articles!

HDFS ha: Historical Evolution of High-reliability distributed storage system solutions HDFS tracing: logic process and source code analysis of HDFS operations

HDFS tracing: lease, fault tolerance Processing during read/write, and main data structure of NN

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More