OpenAI Gym Learning

Source: Internet
Author: User
Tags assert gopher
Observation (observations)

The previous blog introduced the use of OpenAI Gym's cartpole (inverted pendulum) demo, if you want to do in each step better than taking random action, then the actual understanding of the impact of action on the environment may be good.
The step function of the environment returns the required information, and the step function returns four values observation, reward, done, info, and here is the specific information: Observation (object): An environment-related object that describes the environment you observe, such as camera pixel information, robot angular speed and angular acceleration, board game in the chessboard state. Reward (float): The sum of all returns derived from previous actions, and the way in which different environments are calculated are not
One, but the goal always increases their total return. Done (Boolean): Determines whether the reset (reset) environment is in place, most tasks are divided into well-defined episodes, and completion is true to indicate that episode has terminated. Info (dict): Diagnostic information for debugging, and sometimes for learning, but formal evaluations do not allow this information to be used for learning.
This is a typical implementation of the Agent-environment loop. Each time step, the Agent selects a action,environment to return a observation and reward.

The process starts by calling reset, which returns an initial observation. So the more appropriate way to write the last blog code is to follow the complete logo:

Import Gym
env = gym.make (' cartpole-v0 ') for
I_episode in range:
    observation = Env.reset () to T in
    R Ange (m):
        env.render ()
        print (observation)
        action = env.action_space.sample ()
        observation, reward , done, info = Env.step (action)
        if done:
            print (' episode finished after {} timesteps '. Format (t+1))
            break

When done is true, control fails, and this phase ends episode. Can calculate the return of each episode is its adherence to the t+1 time, the longer the longer the return, in the above algorithm, the agent's behavior choice is random, the average return of about 20.

[0.00753165  0.8075176  -0.15841931-1.63740717]
[0.023682    1.00410306-0.19116745-1.97497356]
Episode finished after Timesteps
[ -0.01027234-0.00503277  0.01774634  0.01849733]
[-0.01037299- 0.20040467  0.01811628  0.31672619]
[ -0.01438109-0.00554538  0.02445081  0.02981111]
[ -0.01449199  0.18921755  0.02504703-0.25505814]
[ -0.01070764  0.38397309  0.01994587- 0.53973677]
[ -0.00302818  0.57880906  0.00915113-0.8260689]
[0.008548    0.77380468- 0.00737025-1.11585968]
[0.02402409  0.9690226  -0.02968744-1.41084543]
[0.04340455  1.16449982-0.05790435-1.71265888]
[0.06669454  1.36023677-0.09215753-2.0227866]
[0.09389928  1.55618414-0.13261326-2.34251638]
[0.12502296  1.75222707-0.17946359-2.67287294]
Episode Finished after timesteps
space (spaces)

In the example above, a random action has been extracted from the action space of the environment. But what exactly are these actions? Each environment has a first-level space object that describes the effective action and the observed result:

Import Gym
env = gym.make (' cartpole-v0 ')
print (env.action_space)
#> Discrete (2)
print ( Env.observation_space)
#> Box (4,)      

The discrete space allows for a fixed range of non-negative numbers, so in this case the effective action is 0 or 1. The box space represents an n-dimensional frame, so effective observation will be an array of 4 digits. You can also check the scope of the box:

Print (Env.observation_space.high)
#> Array ([2.4       ,         inf,  0.20943951,         inf])
print ( Env.observation_space.low)
#> Array ([ -2.4       ,        -inf, -0.20943951,        -inf])

This introspection (introspection) can help write common code that works for many different environments. box and discrete are the most commonly used spaces that can be sampled from a space or checked for content that belongs to it:

From gym import spaces space
= spaces. Discrete (8) # Set with 8 elements {0, 1, 2, ..., 7}
x = Space.sample ()
assert Space.contains (x)
assert SPAC E.N = 8

For Cartpole-v0, one of the operations will exert force to the left, one exerting force to the right. Environment (environments)

The main purpose of gym is to provide a large number of environments that expose common interfaces, and to make versioning so that comparisons can be made to see which environments are provided by the system:

from Gym import Envs print (Envs.registry.all ()) [Envspec (Predictactionscartpole-v0), Envspec (Asteroids-ramdeterministic-v0), Envspec (Asteroids-ramdeterministic-v3), Envspec ( GOPHER-RAMDETERMINISTIC-V3), Envspec (Gopher-ramdeterministic-v0), Envspec (Doubledunk-ramdeterministic-v3), Envspec (Doubledunk-ramdeterministic-v0), Envspec (Tennis-ramnoframeskip-v3), Envspec ( Roadrunner-ramdeterministic-v0), Envspec (Robotank-ram-v3), Envspec (Cartpole-v0), Envspec (CARTPOLE-V1), EnvSpec ( GOPHER-RAM-V3), Envspec (gopher-ram-v0) ... 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.