Learn about Microsoft Open Source Core machine learning technology DMTK

Source: Internet
Author: User
Tags rounds

Remember November 9 Google study launched the second generation of open source machine learning software Library TensorFlow, Google said that in the establishment and training of neural networks, tensorflow speed is 5 times times faster than the first generation system, can support CPU, GPU, desktop, Platforms such as server and mobile computing. TensorFlow attracts developers with a wide range of eyeballs.

On the same day, Microsoft Research Asia also open source for the Distributed Machine Learning Toolkit DMTK. The open source version of DMTK contains the world's largest thematic model and distributed word vector model, which is said to be several orders of magnitude higher than similar models. So that some developers exclaim, how can Microsoft put such core technology to open source?

So, what is the DMTK Distributed machine learning package? This is to be said from the history of DMTK's development. DMTK's main research and development director, Microsoft Research Asia AI Research Group, chief researcher, Carnegie Mellon University (CMU) doctoral tutor Tie told reporters that in recent years, there are three major trends in the field of machine learning in the world: larger-scale machine learning, Deeper machine learning and more interactive machine learning are based on the rise of big data and cloud computing.

Microsoft Research Asia has been developing DMTK distributed machine learning systems since two years ago. First, DMTK meets the demands of large-scale machine learning by means of distributed computing deployments. Because of the popularity of cloud computing and high-performance processors, machine learning extends from a single machine environment to a multi-computer environment or even a clustered system. Distributed machine learning is the ability to extend the computing power of the original single machine to tens of thousands of servers through the deployment of machines learning algorithms on a more inexpensive cluster system.

DMTK Open Source provides a simple and efficient distributed machine learning framework consisting of a parametric server and a client Software development Kit (SDK). Developers can easily extend their own machine learning algorithms from a standalone environment to a multi-machine or clustered environment with just a few simple lines of code. This significantly reduces the threshold for machine learning, whether it is a university researcher or a business machine learning developer, that can easily extend the computing environment and computing resources of the machine learning algorithm based on the Microsoft DMTK Open source version, enabling large-scale machine learning based on big data.

Second, DMTK also offers a wealth of machine learning algorithms to accommodate deeper, faster machine learning. The current open source version of DMTK contains two unique machine learning algorithms: the Lightlda topic model and the distributed word vector model.

What is a thematic model? The Internet and social platforms have spawned huge textual content, and data mining through machine learning can draw on relevant topics (TOPIC), which is the basis for machine learning and text understanding. According to tie, the Lightlda algorithm provided by DMTK is the only machine learning algorithm in the world that can train more than 1 million subjects, and it can train such a large topic model with only 20 servers (more than 300 CPU cores), which makes other similar systems far behind.

Last year, the Aliaslda algorithm, which received the best paper award from the International Data Mining Conference (KDD), used up to 10,000 CPU cores to complete 2000 topics of training. The Lightlda algorithm has been able to train a number of orders of magnitude higher than the computational resources of Aliaslda, because it has an original, highly efficient sampling method that is independent of the number of topics involved. This eliminates the need for a larger number of computing resources, even if there are more subjects to train. It is understood that Lightlda has helped many of Microsoft's key products to achieve a performance leap.

Another more magical distributed word vector Training model algorithm is able to better calculate the "distance" between the two words. To put it simply, the search for content through search engines in the past relies heavily on exact matching of search keywords. If you have a search page with the same vocabulary as the search keyword, the search page link will appear on the search results page. But in the field of advertising display, topic exploration, vertical search and other applications, the need for semantic level matching, that is, semantic relevance matching. Word vector model by mining text data, for each word training thousands of related indicators (dimensions), and a word with thousands of dimensions is a vector, mathematically calculate the distance between two word vectors, that is, can effectively characterize the semantic correlation between two words.

The distributed word vector model contained in DMTK is the only multi-machine version of a word vector model, which can extend the single-machine computing resources to multiple machines or clusters, thus learning word vectors more quickly and efficiently. The distributed word vector model pushes "search" to the "exploration" stage, which is expected to bring about disruptive changes to the entire search and related industries.

It is understood that DMTK has been applied to Microsoft's Bing search engine, advertising, Xiaoice and other online products, to achieve a more interactive machine learning. In the case of Microsoft Xiaoice, as a chat robot, the average number of conversations between human users and Xiaoice reached 18 rounds, compared with just 1.5 to 2 rounds of the average number of the most advanced robots in the class. This means that Microsoft Xiaoice "hit" The relevant words in conversation with humans far more than similar technologies, creating a better interactive machine learning experience.

DMTK Open Source in the GitHub open source community has been posted in the top 10 location for a week, DMTK official website current traffic has exceeded million, DMTK executable file download reached Shinian, and the GitHub developers in a week to give DMTK thousands of stars, This is a heat that many of the same open source projects have been unable to achieve in years.

What is the difference between open source TensorFlow and DMTK on the same day? Reporter learned that Google is currently open source TensorFlow, as a stand-alone deep learning tool does not support distributed computing, and Microsoft DMTK Open Source version of the support for distributed, heterogeneous, asynchronous computing cluster environment deployment. In addition, Google's TensorFlow is mainly a system implementation, does not include the innovation of the algorithm, while the DMTK is both, so you can use less resources to train a large number of models of n magnitude.

So why do tech giants have to open up machine learning technology? This is to promote the popularization of the entire machine learning application, open up new opportunities for the AI and robotics industry through open source high-end algorithms and software. On the other hand, it is to pull the ecology of software and algorithm in a deeper level, and lay the next generation industry pattern from strategic vantage point. (Wen/Ningchuang, the first titanium media in this article)

"More exciting content in the era of cloud technology " No.: Cloudtechtime "


Learn about Microsoft Open Source Core machine learning technology DMTK

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.