Big data "veterans" talk about big data infrastructure construction

Source: Internet
Author: User
Keywords Big data big data they big data they US big data they we very big data they we very work

Martin Leach is very busy with big data work. He worked as CIO at the Broad Institute, a joint MIT and Harvard University, where he was responsible for storing 13PB of data, using supercomputers for computing. He and his team have made a remarkable contribution to the human genome mapping effort.

Before the institute, his team supported the research team for the pharmaceutical giant Merck development. Now, his new job is the vice president of it research at biotech company Biogen, who currently has some data scientists in the team. This team has a large data analysis process to ensure the Biogen Company's research and development.

Before he left the nonprofit organization Broad Biological Research Institute, our editors interviewed Leach. He describes the difficulties CIOs face in large data and the technology and capabilities needed to handle large data. Leach says investment in big data analysis has risen from the initial $2 million trillion to $4 million trillion, with few experts willing to work with open source tools. The least-valued data scientists often find real useful data for businesses.

Q: What advice do you usually have as a CIO consultant to provide advice on building a large data infrastructure?

Leach: The first phase is to identify what the enterprise's Big Data Project plan is. What is the biggest requirement for this project, this is the most important issue in the initial stage, not to consider what technology or what needs to be purchased.

Q: At the Broad Institute, what are their biggest needs for big data projects?

Leach: The biggest demand at the time was to solve the problem of internal data generation, digestion and storage. At that time there was a competition between public institutions, such as broad and the private sector, to see who could make the human genome map. Because of this external driving force, we are thinking about how to make the project faster. We either slowed down or abandoned the project or found a quicker way to implement the project.

This is definitely a challenge for me, especially since I don't know much about biotechnology. They outsource some of the experiments and transfer the generated data, and all of a sudden they have a trillion-byte transmission, and they have the question: "What kind of data do I have on that hard drive?" How do I get this data? Where do I place the data when I calculate it? How can I calculate? "What I see in a group of life scientists is that they have a very strong demand for data processing, and their first question is:" How do I handle this data? Where should I put it? ”

Q: Where are they stored?

Leach: Many companies will be placed inside the company. But some companies will be in the cloud, but the amount of data is small and not used. Data in the field of life sciences usually include genetics and genomics, drug information or patient records, and there is a lot of concern if they are stored outside the firewall.

So, when you're sure why you need data, the next job is to think about how to store them. The next one is how to use the computer to process data. Does it need to be stored in the internal computer, or in the cloud, such as Amazon, when needed to be processed? This involves another guess why data needs to be handled internally first.

Q: Is it easy to get data?

Leach: The real acquisition process is not simple. Given the speed of transmission, some companies will be transported from the cloud. Some use hard drive transmissions. There are a lot of questions involved, like, you get data from Boston, but your datacenter is North Carolina State, and the question I need to solve is how do I get tens of billions of bytes of data to the server through the corporate network, so what do I do?

Q: How does the company handle data acquisition?

Leach: In some cases, when you're looking at a bunch of data on a hard disk, the business is negatively dealing with data being sent to the server. In some cases, companies are trying to use data in their internal networks, which in turn affects their internal networks because they move data to a typical enterprise data network rather than a data center. Others are working closely with the IT department.

This depends in part on how other parts of the enterprise work with it. I think the network speed limiter is designed to make other departments work better with it, and to ensure that it is flexible enough. Such projects are not traditionally standard IT infrastructures. To try to develop large data on an Oracle database, Oracle advises you to buy some external hardware, but you need database experts who understand not only regular database relationships but also NoSQL, CouchDB, MongoDB, and so on.

The next step is to find a group of highly qualified people who can skillfully use open source technology, such as Hadoop, OpenStack, and so on. Talent is vital to the team, and I often hear colleagues complaining: "Where do I find the real steed?" ”

Q: What areas are CIOs going to find talent?

Q: I learned from the CTO at ebay that an important area is the Economist. Economists like to look for gold in the data, and they like to use data to solve deep-seated problems. A group of economists who suddenly realized the big data said: Wow, we've never dealt with this level of data.

Q: So you can only find people who like data mining to avoid using open source tools?

Leach: I once saw a group of physicists working in large data fields. The staff in the Hadron Collider need to immerse themselves in the number of PBS-level data generated by the machine every day. Economists, physicists and those who like derivatives are typical data analysts: they like data. I'm going to find the right people in the Economist's field because I didn't take them too seriously.

Q: What is the biggest misconception that some companies have about big data?

Leach: I don't think a lot of companies are aware of how they handled the data carefully at the outset. You spend less time on data management, annotations, and organization, which can affect how you use data. We see from a statistic that when our project is completed five months later, no one is going to look at the data. What do you do with your data for the past two years? Erase him? Or a new organization? Given the current drop in data storage costs, we can store this data.

Q: Maybe that's what you're talking about, when people start to face big data, they tend to be short-sighted?

Leach: It's not just the IT department that's short-sighted, it's the same with data collectors. IT departments are responsible for data collection, from an IT point of view, IT departments will not consider the long-term, but the collectors are only focused on the current data, or focus on the data they collected.

Q: In order to achieve the goal of large data, you need to collect enough data, the more your mobile phone, the more accurate the prediction, you can understand this?

Leach: Yes, big data is just "big" if you can really handle it.

(Responsible editor: The good of the Legacy)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.