Read this article before you reproduce a deep-strengthening study paper!

Last Update:2018-10-01 Source: Internet

Author: User

Tags random seed jupyter jupyter notebook

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Last year, OpenAI and DeepMind teamed up to do the coolest experiments of the time, without the classical reward signals to train the agents, but rather a new approach to reinforcement learning based on human feedback. There is a blog dedicated to this experiment learning from Human Preferences, the original paper is "deep reinforcement learning from Human Preferences" (based on the human preferences of depth enhancement learning).

Link:https://arxiv.org/pdf/1706.03741.pdf

With some deep intensive learning, you can also train sticks for a somersault.

I've seen some suggestions: reproducing papers is a great way to improve machine learning, which is an interesting attempt on my own. Learning from Human Preferences really is a very interesting project, I am very happy to reproduce it, but in retrospect this experience, but with the expected discrepancy.

If you also want to reproduce your paper, here are some tips for deep reinforcement learning:

· · ·

First of all, reinforcement learning is usually more complicated than you expect.

A large part of the reason is that intensive learning is very sensitive. There are a lot of details that need to be dealt with correctly, and if not, it's hard to tell what's wrong.

Scenario 1: When the basic implementation is complete, the training is not successful. I have a variety of ideas about this, but it turns out to be because of the regularization of incentives and the 1 of pixel data at the critical stage. Even though I knew what was wrong, I couldn't find the path to follow: The network accuracy of the excitation predictor based on the pixel data is really good, and it took me a long time to examine the excitation predictor carefully, only to notice that the excitation regularization error was noticed. Find out the cause of the problem is accidental, because notice a small mistake, only to find the right path.

Scenario 2: When I did the final code cleanup, I realized I had dropout wrong. The excitation Predictor network takes a pair of video clips as inputs, and each video fragment is treated identically by two networks with shared weights. If you add dropout to each network and accidentally forget to provide the same random seed for each network, the dropout network is different for each network, so the video clips won't be processed the same way. Although predicting the accuracy of the network looks exactly the same, it actually completely destroys the original network training.

Which one is bad? Well, I don't see it either.

I think this is something that often happens (for example: deep reinforcement learning doesn's t work yet). My harvest is that when you start an intensive learning project, you will theoretically encounter a dilemma like being trapped in a math problem. It's not like my usual programming experience: in a place where you're trapped, there's usually a clear clue to follow, and you can get out of trouble in a matter of days. It's more like when you're trying to solve a problem where there's no obvious progression, the only way is to try it out until you find the key evidence or find the important inspiration to find the answer.

So the conclusion is that you should be aware of the details whenever you are confused.

There are a lot of points in this project, the only clue is the attention to the trivial small things. For example, it is sometimes useful to have differences between frames as features. It was tempting to take advantage of the new features, but I realized that I didn't know why it had such a big impact on the simple environment I was using. Only under such confusion can we find that the difference between the frame and the background zero will make the regularization problem appear.

I'm not sure how to make people aware of this, but my best guess at the moment is:

Learn to understand what it feels like to be confused. There are a lot of different "not too right" feelings. Sometimes you know the code is hard to read. Sometimes worry about wasting time on the wrong thing. But sometimes you see something you don't anticipate: confusion. It is important to recognize the exact extent of the discomfort so that you can spot the problem.
Develop the habit of persisting in confusion. There are some uncomfortable places that you can temporarily ignore (for example, code smell in the prototype development process), but confusion cannot be ignored. It's important to try to find the cause when you feel confused.

Also, it's best to be prepared to get into trouble every few weeks. If you keep going, pay attention to the small details and be confident, you can reach the other side.

· · ·

Speaking of the different experiences of programming in the past, the second major learning experience is the difference in the way of thinking needed to work in the long iteration time.

Debugging seems to involve four basic steps:

Collect relevant evidence on the possibility of a problem
Form assumptions about the problem (according to the evidence you have collected so far)
Choose the most probable hypothesis, implement the fix, and see what happens
Repeat the above steps until the problem disappears

In most of the programming I've done before, I've gotten used to quick feedback. If something doesn't work, you can make a change and see what difference it will produce in a few seconds or minutes. It's easy to collect the evidence.

In fact, in the case of quick feedback, it is much easier to collect evidence than to make assumptions. When you can validate the first idea in a short time, why take 15 minutes to think about everything, even if they may be the cause of the phenomenon? In other words: If you can get quick feedback, you don't have to think about it, just keep trying.

If you're using an attempt strategy, it's a waste of time to spend 10 hours each time you try. What if the last time was unsuccessful? Well, I admit it's sometimes the case. Then we'll check it again. Coming back the next morning: still useless? Well, maybe another one, so let's run it again. After one weeks, you haven't solved the problem yet.

Running multiple times at the same time, trying different things in some way will help. But a) unless you can use clustering, you may end up paying a lot of cost to the cloud (see below); b) Because of the difficulties of reinforcement learning mentioned above, if you try to iterate too quickly, you may never realize what kind of evidence you need.

From a large number of experiments and a small amount of thinking, to a small number of attempts and a lot of thinking, is a key shift in productivity. When debugging over long iterations, you really need to devote a lot of time to building assumptions--forming steps--thinking about what all the possibilities are, how likely they seem to be, and how much they seem to be possible, depending on what you've seen so far. Spend as much time as you can on it, even if it takes 30 minutes or one hours. Once you've enriched the hypothetical space as much as possible, you know what evidence allows you to best differentiate between different possibilities before you start experimenting.

(If you take the project as an amateur project, it is particularly important to consider the issue carefully.) If you work only one hours per day on this project and each iteration takes a day, the number of runs per week is more like a rare commodity that you have to make full use of. Squeezing out working time every day to think about how to improve the results of the operation can make people feel very stressed. So change your mind and take a few days to think, rather than start anything, until I'm confident that the "What's the problem" hypothesis. ）

To think more, sticking to a more detailed work log is a very important link. When it takes less than a few hours to progress, it doesn't matter if you don't have a weekday log, but if you grow a bit longer, you'll easily forget what you've already tried, and you'll just have to go in circles. The log format I summarized is:

Log 1: What is the specific output I'm doing now?

Log 2: Think boldly, such as the assumptions about the current problem, what to do next?

Log 3: Make a record of the current run and briefly remind you which questions to answer each time you run.

Log 4: Results of the operation (Tensorboard, any other important observations), categorized by run type (e.g., in the context of the agent's training)

At first, my log was relatively small, but at the end of the project, my attitude was more inclined to "record everything I think". While the cost of time is high, I think it is worthwhile, in part because some debugs require cross-referencing results and ideas, often in intervals of days or weeks, partly because (at least this is my impression) from a large-scale upgrade to an overall improvement in effective thinking.

A typical log

· · ·

In order to maximize the effectiveness of the experimental results, I did two things in the course of the experiment, which may play a role in the future.

First, record all metrics so you can maximize the amount of evidence collected at each run. There are some obvious indicators, such as training/validation accuracy. Of course, with a lot of time to brainstorm, it's also important to study other indicators to diagnose potential problems.

The reason I'm suggesting this is partly because of the back-view bias, because I know which indicators should start recording earlier. It is difficult to predict which indicators will be useful at the advanced stage. However, the strategies that might be useful are:

For every important component in the system, consider what you can measure. If you have a database, measure how fast it grows in size. If there is a queue, measure the speed of the processing item.

For each complex process, it took a long time to measure the different parts of it. If you have a training cycle, measure the time required for each batch to run. If you have a complex reasoning process, measure the time spent on each sub-inference. These times can be very useful for future performance tuning, and sometimes some other bugs that are hard to spot. (For example, if you see some results that are getting longer, it could be because of a memory leak.) )

Similarly, consider analyzing memory usage for different components. Small memory leaks can point to a variety of things.

Another strategy is to observe what other people are measuring. In the context of intensive learning, John Schulman in his research nuts and bolts of deep RL talk (https://www. Youtube.com/watch? V=8ecdack9kaq ) has some good suggestions. For the strategy gradient approach, I found that the policy entropy is a good indicator, which can well reflect whether the training is done again, more sensitive than the reward of each training.

Unhealthy and healthy strategies for entropy maps. Failure Mode 1 (left): converges to Chang (randomly selects a subset of behaviors); failure Mode 2 (middle): converges to zero entropy (the same action is selected each time). Right: The strategy entropy of successful table tennis training running

When you see something suspicious in the measurement of the record, remember to be aware of the confusion, rather than assume that it is an important thing, rather than just a inefficient implementation of some data structures. (I ignored a tiny but inexplicable decay in the frames per second, resulting in several months of multithreaded errors.) )

Debugging is much easier if you can see all the metrics in one place. I like to use Tensorboard as much as possible. Using TensorFlow to record metrics is difficult, so consider using view Easy-tf-log (https://github.com/mrahtz/easy-tf-log), It provides a simple interface without any additional setup interface Tflog (key, value).

The second thing that seems to make sense is to spend time trying and predicting failure in advance.

The reason for the failure is often obvious, thanks to the back-view bias. But what's really frustrating is that the failure pattern is obvious until you see what it is. When you start to train a model, and then you come back the next day to see it fail, even before you study the cause of failure, you realize "Oh, that must be because I forgot to set Frobulator"?

The simple fact is that sometimes you can trigger this semi-hindsight (half-hindsight-realisation) in advance. It requires a conscious effort to stop and think for five minutes before starting to run. I think the most useful script is: 2

1. Ask yourself, "How surprised if you fail to run?" ”

2. If the answer is "not very surprised", then imagine yourself in a future scenario-this run has failed, and then ask yourself, "if it fails, what is wrong?" ”

3, correction of the thought of any place

4, repeat the above steps to know the answer to question 1 is "very surprised" (or at least "how surprised to be surprised")

There will always be some errors you can't predict, and sometimes you'll still ignore some obviously avoidable errors, but this approach at least seems to reduce some of the very foolish mistakes you might have made without thinking ahead.

· · ·

Finally, the most surprising thing about the project is the time spent, as well as the computational resources required.

I initially estimated that as an amateur project, it would take 3 months. But it actually took about 8 months. (The initial estimate is actually very pessimistic!) This is partly due to an underestimation of the amount of time that may be spent in each phase, but the biggest underestimation is that there are no projections of other things outside the project. It's hard to say how rigorous this inference is, but for an amateur project, doubling your initial estimated time (already pessimistic) may be a good experience.

Even more surprising is the amount of time that is actually spent in each phase. My initial project plan in the main phase of the timetable is basically as follows:

This is the actual time spent at each stage

Instead of writing code, it took a long time to debug the code. In fact, running up to a so-called simple environment took 4 times times the initial expected implementation time. (This is the first amateur project I've spent hours in a row, but the experience has been similar to a past machine learning project.) ）

(note: Carefully designed from the outset, you can imagine the "simple" environment of reinforcement learning.) In particular, consider carefully: a) Whether your reward can really convey the correct information about the task, B) whether the reward depends only on the previous observations or on the current action. In fact, if you are making arbitrary reward predictions, the latter may also be relevant, for example, using a critic)

The other is the total amount of compute resources required. I am fortunate to be able to use the school cluster, although the machine is only CPU, but for some work is already very good. I experimented with two cloud services for work that required GPUs (such as fast iterations on some small parts) or when the cluster was too busy: the Google Cloud Engine virtual machine (https://console.cloud.google.com/ Projectselector/compute/instances?supportedpurview=project&pli=1), Floydhub (https://www. floydhub.com/).

If you only want to access the GPU machine through the shell, Google Cloud computing engine is good, but I am more on the floydhub to try. Floydhub is basically a cloud computing service specifically for machine learning. Run the Floyd run Python awesomecode.py command, Floydhub will initialize a container, upload your code, and run your code. Floydhub is so powerful there are two key factors:

The container is preloaded with GPU-driven and common libraries. (Even in 2018, I still spent hours on the Google Cloud Engine virtual machine dealing with the version of Cuda when updating TensorFlow.) ）
Each run is automatically archived. For each run, the code used, the command used to run the code, any output from the command line, and any output data are automatically saved and indexed through a web interface.

The Web interface for Floydhub. Above: An overview of the history run index, and a single run. Below: The code used for each run and any data that runs the output are automatically archived.

The 2nd point of importance is beyond my word. For any project, this long-term, detailed recording of operations and the ability to reproduce prior experiments is absolutely necessary. Although version control software can help, a) managing large volumes of output is difficult; b) needs to be very diligent. (For example, if you start to run some, then make a little change and then run another, can you clearly use which code when you submit the results for the first run?) You can take notes carefully or check your own system, but on Floydhub it's done automatically without the effort you spend.

I like other aspects of Floydhub:

Once the run is complete, the container is automatically closed. Don't worry about whether the run is complete or if the virtual machine is down.
Pay more directly than Google Cloud. For example, if you pay a 10-hour fee, your virtual machine is immediately recharged for 10 hours. This makes the weekly budget much easier.

One of the troubles I encountered with Floydhub was that it could not customize the container. If your code has a lot of dependent packages, you need to install them before each run. This limits the iteration rate on the short run. However, you can solve this problem (for example, create_floyd_base.sh) by creating a "DataSet" that contains file system changes after installing these dependent packages, and then copying them from the dataset every time you start running. It's awkward, but it's still better than solving a GPU-driven problem.

Floydhub is a little more expensive than Google Cloud: Floydhub a K80 GPU machine for $1.2 per hour, and a similar configuration on Google Cloud would only cost $0.85 per hour (lower if you don't need a machine up to 61G). Unless your budget is really limited, I think the extra convenience Floydhub brings is worth the price. Google Cloud is even more cost-effective when running large numbers of computations in parallel, because you can run multiple on a single large virtual machine.

(the third option is Google's new colaboratory service, which is equivalent to providing you with a jupyter notebook that can be accessed for free on the K80 GPU machine.) But it's not delayed because of Jupyter: You can execute arbitrary commands, and you can set up shell access if you really want to. The biggest drawback is that if you close the browser window, your code won't stay running, and there's a limit to how long it will run before the container that hosts the notebook is reset. So this is not suitable for long-term operation, but it is helpful to run fast prototypes on the GPU. ）

The project cost a total of:

Google calculates 150 hours of GPU uptime on the engine, and 7,700 hours (actual time x cores) of CPU uptime,
Floydhub 292 hours of GPU uptime,
And my university cluster for 1500 hours CPU uptime (actual time, 4 to 16 cores).

I was surprised to find that it cost about $850 ($ 200 on Floydhub and $650 on Google's cloud computing engine) in the 8 months that it took to achieve the project.

Some of the reasons are I'm clumsy (see the Slow Iterations section above). Some of the reasons are that intensive learning is still so inefficient that it takes a long time to run (it takes up to 10 hours each time to train a pong agent).

But a big part of this is that I had an accident at the end of the project: Intensive learning may be less stable and we need to run multiple times to determine performance using different random seeds.

For example, once I think that basically everything is done, I do end-to-end testing on this environment. But even though I have been using the simplest environment, when training a point to move to the center of the square, I still have a very big problem. So I went back to Floydhub to make adjustments and run three copies, and it turns out I thought the excellent super-parameters were only successful once in three Tests.

2/3 of random seeds fail (red/blue) not uncommon

For the amount of computation needed, I give you an intuitive number:

With A3C and 16 people, Pong takes 10 hours to train;
Spend 160 hours of CPU time;
Running 3 random number seeds requires the CPU to take 480 hours (20 days).

Cost Aspects:

Floydhub charges $0.50 USD per hour, using 8-core equipment;
So the charge is $5 per 10 hours of operation;
Run 3 different random number seeds at the same time, costing $15 per run.

It's equivalent to spending 3 sandwiches each time you try to verify your idea.

Again, from the deep reinforcement learning doesn's t work yet this article can be seen, this instability seems normal also acceptable. In fact, even if "five random seeds (a common standard) may not suffice to show that the results are meaningful, by carefully selecting you will get a non-overlapping confidence interval." ”

(All of a sudden, the $25,000 AWS credits that OpenAI scholars plan to offer don't look so crazy, which may be related to the number you give to someone, so don't worry about computing at all.) ）

I mean, if you want to solve an in-depth reinforcement learning project, make sure you know what you're doing. Make sure you're ready to spend as much time and money as you need.

· · ·

In general, it is interesting to reproduce an intensive study paper as an amateur project. In turn, you can think about what skills have been learned, and I'd like to know if it's worthwhile to spend a few months reproducing a paper.

On the other hand, I feel that my research skills in machine learning have not improved significantly (in retrospect, this is actually my goal), but the ability to apply has been improved, and the considerable difficulties in the study seem to produce many interesting and concrete ideas that will make you feel that the time spent is worthwhile. Producing an interesting idea seems to be a problem a) requires a lot of available concepts, B) have a keen sense of smell on good ideas or ideas (for example, what kind of work is useful to the community). In order to achieve these goals, I think it is a good practice to read influential papers, summarize and critically analyze these papers.

So I think the main conclusion I get from this project is that whether you want to improve your engineering skills or your research skills, it's worth thinking carefully. It is not that there is no such thing as both, but if something is your weakness, you can look for a project that is specific to this item to improve your level.

If you want to improve both of these skills, it may be better to read a large number of papers, find a paper that interests you and have a clear code, and try to implement or extend it.

· · ·

If you want to do a deep intensive learning project, here are some details to note.

Find some research papers to reproduce.

Find some relatively single paper, avoid the need for multiple knowledge points to work together in the paper;

Intensive Learning

If you're doing a project that takes the reinforcement learning algorithm as part of a large system, don't try to write an intensive learning algorithm yourself, although this is an interesting challenge and you can learn a lot, but reinforcement learning is not stable enough and you may not be sure that your system is not working because your reinforcement learning algorithm has bugs. , or because you have a bug in this system.
Before you do anything, check how you can use the baseline algorithm in your environment to simplify smart-body training.
Do not forget to standardize observation that these observations are likely to be used in all places.
Once you feel what you have done, write an end-to-end test as soon as possible. Successful training may be more fragile than you might expect.
If you use the OpenAI Gym environment, note that with the-V0 environment, there is a 25% possibility that the current operation is ignored, instead of repeating the previous operation (reducing the certainty of the environment). If you don't want to have that much randomness, use the-V4 environment. Also note that the default environment gives you only 4 frames from the emulator at a time, consistent with the earlier DeepMind papers. If you don't want to, please use the NOFRAMESKIP environment. Because this is a completely stable environment, it actually renders exactly the same as the emulator, for example, you can use PONGNOFRAMESKIP-V4.

General Machine Learning

End-to-end testing takes a long time to run, and you'll waste a lot of time if you're going to do massive refactoring later on. It's better to have a good first run than to figure it out and then save the refactoring to keep the back.
It takes 20 seconds to initialize a module. Wasting time because of grammatical errors, for example, is really a headache. If you don't like the IDE development environment, or because you can only edit it in the Shell's command-line window, it's worth taking the time to create a linter for your editor. (for Vim, I like the ale with Pylint and Flake8.) Flake8 is more like a format checker, which can find problems that pylint cannot find, such as passing an error parameter to a function. Anyway, take some time on the Linter tool to find a stupid error before running.
Not only dropout you have to be careful, you also need to be extra careful when implementing a power-sharing network-This is also batch normalization. Don't forget that there are a lot of normalization statistics and additional variables in the network that need to be matched.
Often see spikes in memory during the run? It may be that your verification batch size is too large.
If you see something strange happen when you use Adam as the optimizer, this could be due to Adam's momentum parameter. Try using an optimizer with no momentum parameters, such as Rmsprop, or by setting β1 = 0 to shield the momentum parameter.

TensorFlow

If you want to debug what happens to some of the nodes in the middle of the calculation diagram, use TF. Print, the input values for each run can be printed.
If you are saving an inferred checkpoint, you can save a lot of space by ignoring the optimizer parameters.
Session.run () is very money-burning. Try bulk calls.
If you run multiple TensorFlow instances on the same machine, there will be a GPU out of memory error, most likely because one of the instances is trying to consume all of the memory space, not because your model is too large. This is the default practice for TensorFlow, you need to tell TensorFlow to use only the memory space on demand, you can refer to allow_growth operation.
If you want to access a chart in a timely manner while you're running a lot of things, like you're accessing the same chart from multiple processes, there's a lock that allows only one process at a time to do that. This appears to be significantly different from Python's global interpretation Lock, and TensorFlow releases the lock before performing heavy work. I am not sure, nor time to do the thorough debugging. However, if you are in the same situation, you can use multiple processes and distribute the chart by using distributed TensorFlow to replicate each process, which will be easier.
Use Python without worrying about overflow, but when applying tensorflow, you need to be extra careful:

When you cannot use the GPU, be careful to switch to the CPU using Allow_soft_placement. If you occasionally write code that does not run on the GPU, it can smoothly switch to the CPU. For example:

I don't know how many things like this can't run on the GPU, but for security reasons, manually switch to the CPU, for example:

A Healthy Mind

Don't be addicted to tensorboard. I'm serious. This is a perfect example of an unpredictable reward addiction: When you check how your operation is running, and it's constantly running, sometimes you'll suddenly win the jackpot when you check! It's a super-exciting thing. If you have an impulse to check tensorboard every once in a while, it is time for you to set a rule that provides a reasonable check interval.

· · ·

If you don't hesitate to read this article, it would be great!

If you also want to enter the field of deep reinforcement learning, here are some resources for you to refer to when you get started:

Andrej Karpathy's deep reinforcement learning:pong from Pixels is a good introductory article on building motivation and intuition.
To learn more about the theory of reinforcement learning, you can refer to David Silver's speech. This speech does not have much to do with deep reinforcement learning (reinforcement learning based on neural networks), but at least teaches you a lot of vocabulary to help you understand the relevant papers.
John Schulman's Nuts and bolts of deep RL talk have many practical recommendations that you might encounter later.

To find out what's going on in the field of deep reinforcement learning, take a look at these things:

Alex Irpan's deep reinforcement learning doesn ' t work yet has a good overview of the current situation.
Vlad Mnih's recent advances and frontiers in the deep RL, there are a lot of practical examples to solve the problems mentioned in the Alex article.
Sergey Levine's deep robotic learning conversation focused on improving robot generalization and sample efficiency.
Pieter Abbeel at the Nips meeting, the Deep learning for Robotics keynote speech, mentions many of the latest intensive learning techniques.

Read this article before you reproduce a deep-strengthening study paper!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More