"This notebooks is part of [AI for Beginners Curriculum](http://aka.ms/ai-beginners). It has been inspired by [this blog post](https://medium.com/swlh/policy-gradient-reinforcement-learning-with-keras-57ca6ed32555), [official TensorFlow documentation](https://www.tensorflow.org/tutorials/reinforcement_learning/actor_critic) and [this Keras RL example](https://keras.io/examples/rl/actor_critic_cartpole/).\n",
"\n",
"In this example, we will use RL to train a model to balance a pole on a cart that can move left and right on horizontal scale. We will use [OpenAI Gym](https://www.gymlibrary.ml/) environment to simulate the pole.\n",
"\n",
"> **Note**: You can run this lesson's code locally (eg. from Visual Studio Code), in which case the simulation will open in a new window. When running the code online, you may need to make some tweaks to the code, as described [here](https://towardsdatascience.com/rendering-openai-gym-envs-on-binder-and-google-colab-536f99391cc7).\n",
"\n",
"We will start by making sure Gym is installed:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Defaulting to user installation because normal site-packages is not writeable\n",
"Requirement already satisfied: gym in /home/leo/.local/lib/python3.10/site-packages (0.25.0)\n",
"/home/leo/.local/lib/python3.10/site-packages/gym/core.py:329: DeprecationWarning: \u001b[33mWARN: Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.\u001b[0m\n",
" deprecation(\n",
"/home/leo/.local/lib/python3.10/site-packages/gym/wrappers/step_api_compatibility.py:39: DeprecationWarning: \u001b[33mWARN: Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.\u001b[0m\n",
"Let's see how the simulation works. The following loop runs the simulation, until `env.step` does not return the termination flag `done`. We will randomly chose actions using `env.action_space.sample()`, which means the experiment will probably fail very fast (CartPole environment terminates when the speed of CartPole, its position or angle are outside certain limits).\n",
"\n",
"> Simulation will open in the new window. You can run the code several times and see how it behaves."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/leo/.local/lib/python3.10/site-packages/gym/core.py:57: DeprecationWarning: \u001b[33mWARN: You are calling render method, but you didn't specified the argument render_mode at environment initialization. To maintain backward compatibility, the environment will render in human mode.\n",
"If you want to render in human mode, initialize the environment in this way: gym.make('EnvName', render_mode='human') and don't call the render method.\n",
"See here for more information: https://www.gymlibrary.ml/content/api/\u001b[0m\n",
" obs, rew, done, info = env.step(env.action_space.sample())\n",
" total_reward += rew\n",
" print(f\"{obs} -> {rew}\")\n",
"print(f\"Total reward: {total_reward}\")\n",
"\n",
"env.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Youn can notice that observations contain 4 numbers. They are:\n",
"- Position of cart\n",
"- Velocity of cart\n",
"- Angle of pole\n",
"- Rotation rate of pole\n",
"\n",
"`rew` is the reward we receive at each step. You can see that in CartPole environment you are rewarded 1 point for each simulation step, and the goal is to maximize total reward, i.e. the time CartPole is able to balance without falling.\n",
"\n",
"During reinforcement learning, our goal is to train a **policy** $\\pi$, that for each state $s$ will tell us which action $a$ to take, so essentially $a = \\pi(s)$.\n",
"\n",
"If you want probabilistic solution, you can think of policy as returning a set of probabilities for each action, i.e. $\\pi(a|s)$ would mean a probability that we should take action $a$ at state $s$.\n",
"\n",
"## Policy Gradient Method\n",
"\n",
"In simplest RL algorithm, called **Policy Gradient**, we will train a neural network to predict the next action."
"We will train the network by running many experiments, and updating our network after each run. Let's define a function that will run the experiment and return the results (so-called **trace**) - all states, actions (and their recommended probabilities), and rewards:"
"You can run one episode with untrained network and observe that total reward (AKA length of episode) is very low:"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total reward: 24.0\n"
]
}
],
"source": [
"s,a,p,r = run_episode()\n",
"print(f\"Total reward: {np.sum(r)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the tricky aspects of policy gradient algorithm is to use **discounted rewards**. The idea is that we compute the vector of total rewards at each step of the game, and during this process we discount the early rewards using some coefficient $gamma$. We also normalize the resulting vector, because we will use it as weight to affect our training: "
"Now let's do the actual training! We will run 300 episodes, and at each episode we will do the following:\n",
"\n",
"1. Run the experiment and collect the trace\n",
"1. Calculate the difference (`gradients`) between the actions taken, and by predicted probabilities. The less the difference is, the more we are sure that we have taken the right action.\n",
"1. Calculate discounted rewards and multiply gradients by discounted rewards - that will make sure that steps with higher rewards will make more effect on the final result than lower-rewarded ones\n",
"1. Expected target actions for our neural network would be partly taken from the predicted probabilities during the run, and partly from calculated gradients. We will use `alpha` parameter to determine to which extent gradients and rewards are taken into account - this is called *learning rate* of reinforcement algorithm.\n",
"1. Finally, we train our network on states and expected actions, and repeat the process "
" # loss = train_on_batch4(states,actions,probs,rewards)\n",
" train_on_batch(states,target)\n",
" history.append(np.sum(rewards))\n",
" if epoch%100==0:\n",
" print(f\"{epoch} -> {np.sum(rewards)}\")\n",
"\n",
"plt.plot(history)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's run the episode with rendering to see the result:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/leo/.local/lib/python3.10/site-packages/gym/core.py:57: DeprecationWarning: \u001b[33mWARN: You are calling render method, but you didn't specified the argument render_mode at environment initialization. To maintain backward compatibility, the environment will render in human mode.\n",
"If you want to render in human mode, initialize the environment in this way: gym.make('EnvName', render_mode='human') and don't call the render method.\n",
"See here for more information: https://www.gymlibrary.ml/content/api/\u001b[0m\n",
"Hopefully, you can see that pole can now balance pretty well!\n",
"\n",
"## Actor-Critic Model\n",
"\n",
"Actor-Critic model is the further development of policy gradients, in which we build a neural network to learn both the policy and estimated rewards. The network will have two outputs (or you can view it as two separate networks):\n",
"* **Actor** will recommend the action to take by giving us the state probability distribution, as in policy gradient model\n",
"* **Critic** would estimate what the reward would be from those actions. It returns total estimated rewards in the future at the given state.\n",
"Now we will run the main training loop. We will use manual network training process by computing proper loss functions and updating network parameters:"
" if running_reward > 195: # Condition to consider the task solved\n",
" print(\"Solved at episode {}!\".format(episode_count))\n",
" break\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's run the episode and see how good our model is:"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [],
"source": [
"_ = run_episode(render=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, let's close the environment."
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [],
"source": [
"env.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Takeaway\n",
"\n",
"We have seen two RL algorithms in this demo: simple policy gradient, and more sophisticated actor-critic. You can see that those algorithms operate with abstract notions of state, action and reward - thus they can be applied to very different environments.\n",
"\n",
"Reinforcement learning allows us to learn the best strategy to solve the problem just by looking at the final reward. The fact that we do not need labelled datasets allows us to repeat simulations many times to optimize our models. However, there are still many challenges in RL, which you may learn if you decide to focus more on this interesting area of AI. "