Machine Learning is Facing a Reproducibility Crisis

Source: Internet
Author: User
Keywords machine learning machine learning introduction machine learning algorithms
In a recent chat with a friend, he mentioned that the machine learning model created by his startup company was so confusing that the team caused serious problems when working further on top of each other or sharing it with customers. Even the author of the model sometimes cannot train the same model and get similar results! He hoped that I could provide some reference solutions, but I have to admit that I also encountered the same problem in my work. It may be difficult for people who have not used machine learning to understand, but we are still in a trial period for tracking changes and rebuilding models in the field of machine learning. Sometimes it feels like our programming work has returned to the era when there is no source code version control.



I started programming for a living in the mid-90s, when the standard tool for source code tracking and coordination was Microsoft's VSS (Visual SourceSafe). At that time, the software did not have an atomic check-in, so the same file could not be edited by multiple people. The network copy needs to be scanned every night to avoid inexplicable damage, but even so, there is no guarantee that the database will be intact by the next morning. Despite this, I feel very lucky because a company I once interviewed is still using sticky notes on the wall to manage code versions, and another company hangs all files on the tree. Take them down, modify them and put them back!

What I want to say is that when it comes to version control, I am not a timid person. I have experienced some very bad systems. If necessary, I can use rsync and some security measures to piece together a version control system. Putting all this aside, I can feel my conscience and say that machine learning is by far the worst environment for collaboration and change tracking I have come across.

In order to explain why, let us first look at the life cycle of a typical machine term model:

A researcher (she) decided to try a new image classification architecture.

She copied and pasted part of the code from the previous project to process the input of the data set.

This data set is stored in one of her network folders. This may be a dataset downloaded from ImageNet, but it is not clear which one. One day, someone deleted some image files that were not in JPEG format, or made some minor modifications, but did not record these changes.

She tried some slightly different ideas, fixed some bugs and modified the algorithm. She made these changes on her own machine, then copied all the source code to the GPU cluster, and started a comprehensive training.

She runs many different trainings, and since this process takes days or weeks, she may often modify the code on the local machine.

There may be bugs after the large cluster runs, which means that she needs to modify a certain file, and then copy it to all machines, and then continue training.

She may take part of the trained weights derived from the results of one run, and then use them as a starting point to rerun with a different code.

She kept all the model weights and evaluation scores obtained during the run. When there was no time to run more experiments, she chose one of them as the weight on the final model. These weights may be generated by different code runs, rather than the current code on her development machine.

She may put the final code in her own folder in the source control warehouse.

She published the experimental results, code, and weights obtained from training.

For a conscientious researcher, this is an optimistic situation, but I believe you have seen how difficult it is to try to repeat all the steps and get the same result if others are involved. Each of the above points may be mixed with factors that cause inconsistent results. In addition, what is more troublesome is that the ML framework sacrifices the certainty of the exact number in order to ensure performance, so even if someone repeats the above steps magically, the final result will be slightly different!

In many real situations, the researcher will not accurately record or remember what she did, so it is difficult for even the author (she) to reproduce this model. Even if she does it herself, the framework on which the model code depends will change over time, sometimes even radically. So she may need to take a snapshot of the entire system to ensure that it works properly. In contact with ML researchers, I found that they are very willing to spend time to help reproduce the results of the model, but even with the assistance of the author, it usually takes several months.

Why is this important? Many of my friends encountered problems when trying to reproduce the published model as a benchmark for their papers and asked me for help. If they cannot get the same accuracy as the original author, how can they judge whether their new method has improved? If there is no way to rebuild the model in the face of changes in requirements or platforms, the concerns of relying on the model in the production system are also obvious. If this is the case, it means that the technical debt of the model is no longer a high-interest credit card, but becomes more like a loan shark transaction. This problem also inhibits research experiments, because if the code and training data changes are difficult to revert, then trying different parameters will bring greater risks, just like programming without source code control will increase the cost of experimental changes.

However, the development in this area is not all hopeless, and some communities have made great efforts in reproducibility. One of my favorites is the TensorFlow Benchmarks project led by Toby Boyd. He requires the mission of the team: not only to accurately show how to quickly train models from zero on various platforms, but also to ensure that model training can achieve the expected accuracy. I witnessed him sweating profusely in order to achieve the accuracy of the model, because any changes in the steps I listed above may affect the final result, and even with the help of the author, there is no easy way to debug and check The root cause. Since any changes in TensorFlow, GPU, and other drivers, or even changes in the data set, can affect accuracy in subtle ways, this seems to be a never-ending job. In the process of work, Toby's team helped us find and modify the bugs caused by TensorFlow changes in the model, and traced the external factors that caused the problem, but this is difficult to apply to relatively large platforms and models.

I also know that some other teams are also seriously thinking about the use of models in products. They also spent a lot of time and energy to ensure that the trained models can be reproduced, but the problem is that all this work depends on manual work. operation. For how to record the training process so that it can be successfully run again in the future, we do not have similar source code control, or even the best practices. I personally don’t have a good solution, but I want to discuss some principles here. I think any method that wants to be successful needs to follow these principles:

Researchers must be able to implement new ideas easily without spending a lot of time "walking through the process." If this is not possible, they will not use this method. Ideally, the system can actually increase their productivity.

If a researcher has an accident (hope there will never be an accident) or starts a business, then others should be able to take over the model they created the next day, continue training and get the same results.

There should be a way to package a specific model that needs to be trained and share it publicly without revealing the modification history that the author does not want to disclose.

In order to reproduce the results, the code, data, and the entire platform need to be accurately recorded.

I have seen open source and some start-up companies have some interesting responses around the solutions to these challenges. Personally, I can’t wait to spend all my time dealing with these related issues, but I don’t dare to expect short-term This problem can be completely solved within. No matter what ideas we propose, we must change the way we all use models, just as source code control changes our personal coding process. We need to reach a consensus on best practices and community education, which is as important as the creation of these tools. I can’t wait to see this day come!

Original: https://petewarden.com/2018/03/19/the-machine-learning-reproducibility-crisis/amp/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.