[Reinforcement Learning] Cross-entropy Method

Source: Internet
Author: User

Cross-entropy Method (CEM) is an algorithm based on cross-entropy, but it is not a cross-entropy method based on cross-entropy, rather than an algorithm based on Monte Carlo and evolutionary strategy. The CEM algorithm can be used not only as an evaluation, but also as an effective optimization algorithm, like CEM, an evolutionary algorithm (EAs), which is a completely gradient free (gradients) algorithm.

Here's an explanation of cross-entropy method on Wikipedia [1]:

The Cross-entropy (CE) method is a Monte Carlo method for importance sampling and optimization. It is applicable to both combinatorial and continuous problems, with either a static or noisy objective.

The iterative training process of CEM algorithm can be divided into two stages:

    • Resampling according to the sampling probability distribution;
    • The sampling probability distribution is updated by minimizing the cross-entropy of the sampling probability distribution and the target probability distribution.
Importance Sampling

Cem is going to solve this problem by assuming that we need to estimate an event $h (x) $ expected to occur:

$$\mathbb{e}_{u}[h (x)]=\int H (x) f (x;u) dx$$

The simplest method is to use the naïve Monte-Carlo sampling to sample some samples $x^{'}$ from the real probability density function $f (x;u) $, and then estimate the expectation by averaging:

$$\mathbb{e}_{u}[h (x)]=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{'}$$

But if event $h (x) $ is a small probability event, then the naïve Monte Carlo simulation needs to sample very many samples to accurately estimate expectations. For this problem, the CEM algorithm introduces the importance sampling (importance sampling).

Importance Sampling[2] The main ideas are as follows:

By first sampling with a sample probability distribution similar to the target probability distribution $f (X;V) $ (where $v$ is called reference parameter), the expectation becomes:

$$\mathbb{e}_{u}[h (x)]=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{'}\frac{f (x;u)}{f (x;v)}$$

So now the goal becomes how to find an optimal sampling probability function $f (x;v^{*}) $ to guide sampling a few samples to accurately estimate expectations. CEM updates the parameter $v$ of the sampling probability function by selecting a better sampling sample (Elite samples) in each iteration to reduce the distribution gap between the current sampling probability function $f (X;V) $ and the optimal sampling probability function $f (x;v^{*}) $ (KL divergence, Relative entropy) (Ps:paper finally only uses the cross-entropy, so named Cem).

Pseudo-code of CEM

A CEM pseudo-code with a Gaussian distribution as the sampling probability distribution is given below:

#step1:initializationInitialize sampling probability distribution parameters: Mu, sigma; Initialize current iteration number T=0, set the maximum number of iterations max_its, the number of samples N, the number of elite samples NE, and the error range of sampling variance epsilon; forTinchRange (0, max_its):#STEP2: Random sampling using Gaussian distributionX =Samplegaussian (Mu, Sigma, N)#STEP3: Evaluation SampleS =evaluatesamples (X)#STEP4: Importance samplingX =sort (X, S) mu= Mean (x[0:ne-1]) Sigma= Var (x[0:ne-1])ifSigma >Epsilon: Break#STEP5: Returns the mean of the elite samplereturnMu
CEM && RL

Note: The following is a quote from the blog "Evolutionary Strategy optimization algorithm CEM (cross Entropy Method)" [3].

Cem can also be used to solve Markov decision-making processes, that is, to strengthen learning problems. We know that reinforcement learning is also a dynamic planning process in which an action is selected in a certain state as if a path is selected in a certain node, and the whole process is a path planning problem from the initial state to the final state, but we want to get a path that maximizes revenue. With this in mind, we can model with CEM, we let a complete path become a sample $x= (S_0,a_0,s_1,a_1,..., s_n,a_n) $, the total gain of the path is $s (x) =\sum_{i=0}^{n} r (s_i,a_i) $, The goal is to maximize the $s (x) $, so how do we sample these samples? We can construct a $p$ matrix: The matrix row represents the state, the column represents the action, such as $p_{ij}$ represents the probability of performing $a_j$ action under State $s_i$, we can get multiple samples by multiple sampling of the $p$ matrix and then select $s (x) $ higher sample to update $p The $ matrix, which continually iterates, eventually finds the optimal $\hat{p}$ matrix.

This is a hardening learning approach similar to policy iteration: The $p$ matrix finds the probabilities of each action in each step to form a decision strategy, but the parameter update does not use gradients. From another point of view, you can also think of this is a value iteration (value iteration) Reinforcement learning method, at this time the $p$ matrix is the classic q-learning $q$ matrix, but $q$ matrix of $i$ row $j$ column element represents the state $s_i$ action $ The expectation of a_j$ 's future earnings is based on the Behrman equation (Bellman equation) to update the Q value, whereas the $p$ matrix represents the probability value, which is updated by cross-Shang.

Reference

[1] Wikipedia: Cross-entropy Method

[2] Wikipedia: Importance sampling

[3] Evolutionary Strategy optimization algorithm CEM (cross Entropy method)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.