Li Hang: new trends in Machine Learning learn from Human-Computer Interaction

Source: Internet
Author: User

Li Hang, chief scientist at  Noah's Ark lab, delivered a keynote speech.


Li Hang, chief scientist at  Noah's Ark lab

Li Hang said: so far, we have found that the most effective means of AI research in other fields may be based on data. Using machine learning, we can make our machines more intelligent.

At the same time, Li Hang believes that we need a lot of data to learn exactly how much data we actually learn in machine learning. Even if we only learn a second-class classifier, we may need thousands of labeled data.

  The following is a speech by Li Hang, chief scientist at Noah's Ark lab:

Li Hang: Good morning, everyone. I am very happy to have this opportunity to communicate with you. I attended this meeting for the first time. I mainly did research on myself, the research direction is natural source Processing Information Retrieval Information Mining. Dr. Sun and some colleagues work in the same field.

Today, I will share with you the new trends in machine learning that I have recently observed and felt. obtaining data from human-computer interaction makes learning more effective, able to build a more intelligent system. We all agree that intelligence is an inevitable trend in the development of computer science, making our computers more and more intelligent. In this process, we must have a very powerful means. So far, in other fields of artificial intelligence, we find that the most powerful means may be based on data. Machine Learning can make our machines more intelligent. I wrote a blog post titled "machine learning is changing our work and life", which explains why intelligence must be based on large-scale data and must be driven by statistical machine learning. I made an advertisement here. This year I published a book about statistics learning methods. One of the purposes of this book is to hope software developers can quickly master these methods and quickly build intelligent systems.

Why does machine learning require a large amount of data and how much data is sufficient? Let's look at the new trend of machine learning and how to obtain more data through human-computer interaction, this includes log data mining, repackaging, and collaborative computing that is popular. Finally, I would like to introduce how to use this data with a large amount of data to build a very intelligent system, making our system more intelligent.

We all know that statistical machine learning is based on data. The most important step is to collect and collect data. High-quality and large-scale data can help us build a very intelligent system. There is a very simple question: how much data can help us build an intelligent system? This is a very important question. There are many studies in machine learning. For example, an important research topic in statistical machine learning theory is the so-called sample complexity. How many samples do I need and how much training data, to learn a model well. This is a very difficult question. Although I have been engaged in many studies, sometimes it is not easy to answer this question. For example, we want to build a second-class classifier. This is the most basic model in machine learning. We need to build a second-class classifier to determine whether an image contains a face or not, which is a so-called face detection problem. There is a theoretical theorem for this problem, called the Occam Razor. The conclusion is that the sample size and number of samples are related to the learning accuracy when learning a second-class classifier. We hope that the higher the learning accuracy, the more samples we need. At the same time, it is related to the complexity of the learning model. If the model is very complex, there will be a lot of samples to learn. What is this conclusion? If the number of samples is expressed by the S set capacity, it must be at least greater than this quantity. This quantity is actually consistent with the accuracy of the model we require, and the validation reliability of our judgment, we hope to judge whether it is very reliable. For example, we want to learn a second-class classifier with a model complexity of 100. According to the theorem, we can see that more than 50 thousand of the training data can be used to learn such a classifier well. This is a very large volume. We need a lot of data to complete this task.

In general, we assume that the model to be learned has K parameters. The number of parameters generally indicates the complexity of the model, in terms of experience, at least hundreds of times of model parameters can be used for training samples to better understand the model. Our current applications are often very complex and intelligent in what needs to be done. In this case, the number of model parameters is very large, and sometimes it is model parameters of millions, if we want to learn a model well by adding hundreds of times of samples, we need a large number of learning samples in machine learning. This learning sample can be used for learning instead of simply collecting some data. What do we usually do? For example, if we see a photo taken by a digital camera and whether there is a face in the face detection, this is because we have spent money to hire a lot of professionals to manually mark the photo, when a person of different ages, races, and gender appears in a photo under different lighting conditions and pixel conditions, whether there is a face or not should be labeled with a large amount of such data, with the help of a large amount of data, we can collect data that can truly cover various situations, so that we can effectively learn our face detection model. Therefore, in reality, we need a lot of high-quality data to help us build an intelligent system. At this time, we are faced with a great challenge. How can we collect such high-quality data? There is a new trend in Machine Learning: through interaction with humans and machines, we hope to collect a large amount of high-quality data during the interaction process, this has become a new trend in the machine learning field that deserves everyone's attention.

We hope that we can use various clever methods to collect data through human-computer interaction, such as log data mining, crowdsourcing, and man-machine collaborative computing. Now we are also a hot topic in the field of research, it is how to build a better mechanism to effectively collect a large amount of high-quality machine learning training data from users. In Internet search engines, the search engine records all the logs. For example, if a user submits a query, the system returns a set of URLs, the process of clicking a URL is recorded in the search engine as log data. This data is very useful, and it is very helpful to help improve the sorting of search engines. It means that hundreds of millions of users use the search engine every day, and the submitted queries are also larger orders of magnitude, different users submit different queries, and then click different URLs. we can collect this information from the users to find out what information the users need from the search engine, such a large number of queries are closely related to the quality of search engines and the quality of related sorting. This so-called data mining is widely used in search engines. In other different applications, everyone is trying to collect different types of log data, which can help improve related applications. Generally, users will not spend more time doing things during usage. They only use more applications, but we record the user usage process well, taking the data back as user feedback is an implicit feedback. It is a reasonable and natural idea to use this feedback data to help us improve our current application. If the system is based on machine learning, log data can be used to improve the performance of applications. Of course, user behavior data is often noisy. We need to consider how to remove noise and Improve the Quality of log data.

In another example, you may also know Amazon Mechanical Turk. If I am a person who wants to label data, I will put this tagging task requirement on the market, where there are many registered members, they are so-called employees. If these employees can see all kinds of labeled tasks provided, they can mark the tasks they are interested in based on their interests, preferences, and abilities, some people can get some compensation through labeling, and some people use this method to earn their own income in their spare time. Some people regard this as an entertaining or learning process or a time-consuming process, which is very popular. In Amazon
Mechanical Turk has millions of resident registered workers, and a large amount of tagging work is done every day. The so-called tagging includes various types, such as the labeling in the image, face detection can become a tagging task in Amazon Mechanical Turk, A large number of pictures are given to the workers to determine whether the pictures contain human faces. This work is often simple for people and may be labeled in a few seconds. However, it is often very difficult for machines to judge. If a large number of workers help to mark a large amount of image data, it can help us quickly build an intelligent system, this method uses the so-called crowdsourcing to effectively help SMART system developers collect a large amount of data, which can often achieve our goal at a very small cost. Such an environment on the Internet provides us with such a possibility that we can have such a market where we can quickly raise suitable workers to help us complete these tagging tasks, this is the feature of crowdsourcing.
The platform, represented by Mechanical Turk, can link people who have requirements for tagging and provide such a platform for you to perform various tagging work.

Another way is to use the game method to collect data and play games purposefully. A famous game is called esp. Two gamers will show them a picture at the same time, ask them to mark the images at the same time. If they are marked with the same key word or mark, they both score. If they are not consistent, they do not score. In this way, both players want to mark the image accurately so that they can score. According to our general knowledge, two people will try to find a proper common sense tag to mark the tag on the image. The markup on images can be used for machine learning algorithms. Google uses this method for image search.

Another example is reCAPTCHA. We often ask us to enter the verification code when logging on to the website. Many websites use this reCAPTCHA system and provide us with two verification codes that need to be entered, the user does not necessarily know that one of the words is actually a verification code, but the computer deliberately makes some graphical transformations to make it invisible to the machine, the word "human" indicates whether the current user is a human or a machine. Another part is that the reCAPTCHA system cannot identify OCR, because it is difficult to process the OCR process. In the reCAPTCHA system, it will be considered as another part of the verification code, let the user enter the verification code. At this time, the second classification helped us to improve the OCR system. Users provided us with a large amount of OCR training data, which could help us to turn past books into digital data, A large number of users indirectly participate in data-based work on the Internet and help us improve OCR accuracy.

Luis von Ahn, The reCAPTCHA and ESP games I just mentioned, is a very familiar scientist in this age and has put forward many interesting methods. He has further sublimated these concepts and Proposed Human computation, we can regard people as computers. Now there are two types of computers in the world, one is machine computer and the other is human computer. These two computers have their own strengths. We should do what they are good, then the two work collaboratively and complement each other, so that we can better complete many tasks. This is the so-called main idea of man-machine collaborative computing.

There are three ways to help us collect data, one is data mining, one is crowdsourcing, one is human-machine collaborative computing. When a log data mining user does not feel that he has contributed data, he just uses it. In this process, he provides the system with data to help improve the system. In crowdsourcing and human-machine collaborative computing, the user realized that I was involved, more economically paid, and the other was satisfied with others, participated in games, or did other things, this can help improve the system and provide more useful data to the system.

We now have a variety of methods to help us collect a large amount of high-quality and useful data. Can such data really help us do a lot of things? The answer is true. If we design our data collection methods well and design our machine learning methods well, we can really combine the two to build our smart system. I would like to give you an example. This was my previous job at the Microsoft Asia Research Institute. At that time, we were working on internet search projects, we hope to collect a large number of log data from users' searches and click data to help us learn the sorting model or associated model during the search. The problem to be solved is that during the search process, the user's query should match the semantics of the webpage content, but the word surface often does not necessarily match, in this case, the search engine performs webpage-related sorting based on the keyword matching. For example, when a user queries sdcc, the webpage is written to the Chinese software developer conference, and the semantics is associated. However, the word on the top is English and Chinese, and there is also a mismatch between them. What we want to solve in the search is to automatically learn some relevant models and calculate the number of semantic matching relationships between queries and webpages. This is what we need to do. We have the idea that a large number of users can click data to complete the learning task, and a large amount of click data can be collected in the search engine. In fact, we can think of this problem as a problem, we have two spaces: one is the space for query, there is a lot of query data in this space, there is a webpage text space, the webpage space and the text space are similar, it can be calculated in their respective spaces. For example, even if the two queries determine the word size, the distance can be used to calculate the similarity between the two queries. The same is true for text. The text can represent two vectors and determine the similarity between the two texts. What is a very valuable data? We use the click data collected in the log to link the two and query Q1 and text D2. We can see that the data is associated. We know that Q1 and D2 are associated, we hope to learn a model through the given data and map all these queries and texts to a new space. The new space hopes to automatically find a relevance function, distance, or similarity. In the new space, any query can find text similar to it, A similar query can also be found for a text. The original benefit is the shard data, which is converted to the same processing in the new space. What we need to learn is this ing. With click data, we should be able to well judge the relationship between queries and texts. This method actually contains the most basic models in the traditional information retrieval, such as VSM. These models do not need to automatically learn these two mappings. They are equivalent to two manually defined mappings, it is also a simple model. Traditional information retrieval models are simple models. What we want to do is a more general model, the automatic learning method driven by action data is more effective and more general than the traditional information model. As you can probably imagine, we can actually learn the approval between the timeline data through this learning. Although the literal match between the two is not very good, the semantic match, we can learn that the two words are very relevant. As a result, this model is better than the traditional bm25 models by clicking data like this.

This method can also be used to do other things. For example, we can see the graphic image annotation today. If there are many images marked by many people, this is also two different types of data, one is text and the other is image. This image is marked with a hook, which is marked with fishing. There is a text similarity in the text space and an image similarity in the image space. We know the relationship between the two of them, we can map image and text to a new space according to the learning method we just mentioned. We should follow the semantics and get closer to hook and fishing, in this way, we can learn the similarity between metrics data. We can make this scale very large, and images can be at the base level. In fact, we have not learned any semantics or image content, however, with such a large amount of data, we can actually learn the content in the image and learn this thing through such a large amount of labeled data.

In machine learning, how much data is learned? One conclusion is that we need a lot of data. Even if we only learn a second-class classifier, we may need thousands of labeled data, this gives us a great challenge. How can we solve this challenge? In machine learning, there have been some new trends recently, including log data mining, crowdsourcing, and man-machine collaborative computing, we hope to be able to skillfully design a variety of mechanisms to get a large amount of high-quality data from users. We can see that such large-scale high-quality data can be used for Internet search or image tagging. In fact, what are the main problems solved here? First, we need to think of some very clever mechanisms, such as the ESP mechanism. If this mechanism is well set, users can easily participate in the data collection process. For example, esp provides a game mechanism for users to voluntarily participate in the game. Because the game settings are very clever, the user is induced to provide data, and he must provide a good mark to score, with such a clever design, users can be encouraged to provide a large amount of high-quality data. How can we find many users who can help us describe such data. If the data quality is very poor for machine learning, machine learning will be poor, so how can we ensure that the data quality is very high and useful to us, we hope to have a very clever design that meets our conditions, so that we can get a lot of high-quality data. With this data, we need to consider how to build a positive learning method to process a large amount of data and build a high-performance model so that we can achieve our goal and make the system more intelligent.

Finally, I would like to thank my colleague Yang Qiang from Noah's Ark lab. Many methods have been discussed with him, and many of our interns have done the search work, I would like to thank the organizing committee for providing such an opportunity to communicate with you. Thank you.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.