Add Reinforcement Learning lesson 2

3 years ago · 0dd1d6b84a
parent b8449713fa
commit 0dd1d6b84a
8 changed files with 856 additions and 28 deletions
--- a/8-Reinforcement/2-Gym/README.md
+++ b/8-Reinforcement/2-Gym/README.md
@ -1,55 +1,280 @@
-# [Lesson Topic]
+# CartPole Skating

-Add a sketchnote if possible/appropriate
+The problem we have been solving in the previous lesson might seem like a toy problem, not really applicable for real life scenarios. This is not the case, because many real world problems are like that - including playing chess or go. They are similar, because we also have a board with given rules and **discrete state**.

-![Embed a video here if available](video-url)
+In this lesson we will apply the same principles of Q-Learning to a problem with **continuous state**, i.e. a state that is given by one or more real numbers. We will deal with the following problem:

-## [Pre-lecture quiz](link-to-quiz-app) 45
+> **Problem**: If Peter wants to escape from the wolf, he needs to be able to move faster than him. We will see how Peter can learn to skate, in particular, to keep balance, using Q-Learning.

-Describe what we will learn
+We will use a simplified version of balancing known as **CartPole** problem. In cartpole world, we have a horizontal slider that can move left or right, and the goal is to balance a pole staying on top of it.

-### Introduction
+<img src="images/cartpole.png" width="200"/>

-Describe what will be covered
+## Prerequisites

-> Notes
+In this lesson, we will be using a library called **OpenAI Gym** to simulate different **environments**. It is preferred to run this lesson's code locally (eg. from Visual Studio Code), in which case the simulation will open in a new window. When running the code online, you may need to make some tweaks to the code, as described [here](https://towardsdatascience.com/rendering-openai-gym-envs-on-binder-and-google-colab-536f99391cc7).
+## OpenAI Gym

-### Prerequisite
+In the previous lesson, the rules of the game and the state were given by `Board` class, which we defined ourselves. Here we will use a special **sumulation environment**, which will simulate the physics behind the balancing pole. One of the most popular simulation environments for training Reinforcement Learning algorithms is called [Gym](https://gym.openai.com/), which is maintained by [OpenAI](https://openai.com/). By using gym we can create difference **environments**: from cartpole simulation to Atari games.

-What steps should have been covered before this lesson?
+> **Note**: You can see other environments available from OpenAI Gym [here](https://gym.openai.com/envs/#classic_control). 

-### Preparation
+First, let's install the gym and import required libraries:

-Preparatory steps to start this lesson
+```python
+import sys
+!{sys.executable} -m pip install gym 

---
+import gym
+import matplotlib.pyplot as plt
+import numpy as np
+import random
+```
+
+## CartPole Environment
+
+To work with CartPole balancing problem, we need to initialize corresponding environment. Each environment is associated with:
+* **Observation space** that defines the structure of information that we receive from the environment. For cartpole problem, we receive position of the pole, velocity and some other values.
+* **Action space** that defines possible actions. In our case action space is discrete, and consists of two actions - **left** and **right**.
+
+```python
+env = gym.make("CartPole-v1")
+print(env.action_space)
+print(env.observation_space)
+print(env.action_space.sample())
+```
+
+To see how the environment works, let's run a short simulation for 100 steps. At each step, we provide one of the actions to be taken - in this simulation we just randomly select an action from `action_space`. Run the code below and see what it leads to.
+
+> **Note**: Remember that it is preferred to run this code on local Python installation!
+
+```python
+env.reset()
+
+for i in range(100):
+   env.render()
+   env.step(env.action_space.sample())
+env.close()
+```
+
+You should be seeing something similar to this one:
+
+![](images/cartpole-nobalance.gif)
+
+During simulation, we need to get observatons in order to decide how to act. In fact, `step` function returns us back current observations, reward function, and the `done` flag that indicates whether it makes sense to continue the simulation or not:
+
+```python
+env.reset()
+
+done = False
+while not done:
+   env.render()
+   obs, rew, done, info = env.step(env.action_space.sample())
+   print(f"{obs} -> {rew}")
+env.close()
+```
+You will end up seeing something like this in the notebook output:
+```text
+[ 0.03403272 -0.24301182  0.02669811  0.2895829 ] -> 1.0
+[ 0.02917248 -0.04828055  0.03248977  0.00543839] -> 1.0
+[ 0.02820687  0.14636075  0.03259854 -0.27681916] -> 1.0
+[ 0.03113408  0.34100283  0.02706215 -0.55904489] -> 1.0
+[ 0.03795414  0.53573468  0.01588125 -0.84308041] -> 1.0
+...
+[ 0.17299878  0.15868546 -0.20754175 -0.55975453] -> 1.0
+[ 0.17617249  0.35602306 -0.21873684 -0.90998894] -> 1.0
+```
+
+The observation vector that is returned at each step of the simulation contains the following values:
+* Position of cart
+* Velocity of cart
+* Angle of pole
+* Rotation rate of pole
+
+We can get min and max value of those numbers:
+
+```python
+print(env.observation_space.low)
+print(env.observation_space.high)
+```
+
+You may also notice that reward value on each simulation step is always 1. This is because our goal is to survive as long as possible, i.e. keep the pole to a reasonably vertical position for the longest period of time.
+
+> In fact, CartPole simulation is considered solved if we manage to get the average reward of 195 over 100 consecutive trials.
+
+## State Discretization
+
+In Q=Learning, we need to build Q-Table that defines what to do at each state. To be able to do this, we need state to be **discreet**, more precisely, it should contain finite number of disctete values. Thus, we need somehow to **discretize** our observations, mapping them to finite set of states.
+
+There are a few ways we can do this:
+
+* If we know the interval of a certain value, we can divide this interval into a number of **bins**, and then replace the value by the number of bin that it belongs to. This can be done using numpy [`digitize`](https://numpy.org/doc/stable/reference/generated/numpy.digitize.html) method. In this case, we will precisely know the state size, because it will depend on the number of bins we select for digitalization.
+* We can use linear interpolation to bring values to some finite interval (say, from -20 to 20), and then convert numbers to integers by rounding them. This gives us a bit less control on the size of the state, especially if we do not know the exact ranges of input values. For example, in our case 2 out of 4 values do not have upper/lower bounds on their values, which may result in the infinite number of states.
+
+In our example, we will go with the second approach. As you may notice later, despite undefined upper/lower bounds, those value rarely take values outside of certain finite intervals, thus those states with extreme values will be very rare.
+
+Here is the function that will take the observation from our model, and produces a tuple of 4 integer values:
+
+```python
+def discretize(x):
+    return tuple((x/np.array([0.25, 0.25, 0.01, 0.1])).astype(np.int))
+```
+Let's also explore other discretization method using bins:
+```python
+def create_bins(i,num):
+    return np.arange(num+1)*(i[1]-i[0])/num+i[0]
+
+print("Sample bins for interval (-5,5) with 10 bins\n",create_bins((-5,5),10))

-[Step through content in blocks]
+ints = [(-5,5),(-2,2),(-0.5,0.5),(-2,2)] # intervals of values for each parameter
+nbins = [20,20,10,10] # number of bins for each parameter
+bins = [create_bins(ints[i],nbins[i]) for i in range(4)]

-## [Topic 1]
+def discretize_bins(x):
+    return tuple(np.digitize(x[i],bins[i]) for i in range(4))
+```
+
+Let's now run a short simulation and observe those discrete environment values. Feel free to try both `discretize` and `discretize_bins` and see if there is a difference.
+
+> **Note**: `discretize_bins` returns the bin number, which is 0-based, thus for values of input variable around 0 it returns the number from the middle of the interval (10). In `discretize`, we did not care about the range of output values, allowing them to be negative, thus the state values are not shifted, and 0 corresponds to 0.
+
+```python
+env.reset()
+
+done = False
+while not done:
+   #env.render()
+   obs, rew, done, info = env.step(env.action_space.sample())
+   #print(discretize_bins(obs))
+   print(discretize(obs))
+env.close()
+```
+> **Note**: Uncomment the line starting with `env.render` if you want to see how environment executes. Otherwise you can execute it in the background, which is faster. We will use this "invisible" execution during our Q-Learning process.
+
+## Q-Table Structure
+
+In our previous lesson, the state was a simple pair of numbers from 0 to 8, and thus it was convenient to represent Q-Table by numpy tensor with shape 8x8x2. If we use bins discretization, the size of our state vector is also known, so we can use the same approach and represent state by an array of shape 20x20x10x10x2 (here 2 is the dimension of action space, and first dimensions correspond to the number of bins we have selected to use for each of the parameters in observation space).
+
+However, sometimes precise dimensions of the observation space are not known. In case of `discretize` function, we may never be sure that our state stays within certain limits, because some of the original values are not bound. Thus, we will use slightly different approach and represent Q-Table by a dictionary. We will use the pair *(state,action)* as the dictionary key, and the value would correspond to Q-Table entry value. 
+
+```python
+Q = {}
+actions = (0,1)
+
+def qvalues(state):
+    return [Q.get((state,a),0) for a in actions]
+```
+
+Here we also define a function `qvalues`, which returns a list of Q-Table values for a given state that correspond to all possible actions. If the entry is not present in the Q-Table, we will return 0 as the default.
+
+## Let's Start Q-Learning!
+
+Now we are ready to teach Peter to balance! First, let's set some hyperparameterers:
+
+```python
+# hyperparameters
+alpha = 0.3
+gamma = 0.9
+epsilon = 0.90
+```
+
+Here, `alpha` is the **learning rate** that defines to which extent we should adjust the current values of Q-Table at each step. In previous lesson we have started with 1, and then decreased `alpha` to lower values during training. In this example we will keep it constant just for simplicity, and you can experiment with adjusting `alpha` values later.
+
+`gamma` is the **discount factor** that shows to which extent we should prioritize future reward over current reward.
+
+`epsilon` is the **exploration/exploitation factor** that determines whether we should prefer exploration to exploitation or vice versa. In our algorithm, we will in `epsilon` percent of the cases select the next action according to Q-Table values, and in the remaining number of cases we will execute random action. This will allow us to explore the areas of search space that we have never seen before. 
+
+> In terms of balancing - choosing random action (exploration) would act as a random punch in the wrong direction, and the pole would have to learn how to recover the balance from those "mistakes"
+
+We would also make two improvements to our algorithm from the previous lesson:
+
+* Calculating average cumulative reward over a number of simulations. We will print the progress each 5000 iterations, and we will average out our cumulative reward over that period of time. It means that if we get more than 195 point - we can consider the problem solved, with even higher quality than required.
+* We will calculate maximum average cumulative result `Qmax`, and we will store the Q-Table corresponding to that result. When you run the training you will notice that sometimes the average cumulative result starts to drop, and we want to keep the values of Q-Table that correspond to the best model observed during training.
+
+We will also collect all cumulative rewards at each simulation at `rewards` vector for further plotting.

-### Task:
+```python
+def probs(v,eps=1e-4):
+    v = v-v.min()+eps
+    v = v/v.sum()
+    return v

-Work together to progressively enhance your codebase to build the project with shared code:
+Qmax = 0
+cum_rewards = []
+rewards = []
+for epoch in range(100000):
+    obs = env.reset()
+    done = False
+    cum_reward=0
+    # == do the simulation ==
+    while not done:
+        s = discretize(obs)
+        if random.random()<epsilon:
+            # exploitation - chose the action according to Q-Table probabilities
+            v = probs(np.array(qvalues(s)))
+            a = random.choices(actions,weights=v)[0]
+        else:
+            # exploration - randomly chose the action
+            a = np.random.randint(env.action_space.n)

-```html
-code blocks
+        obs, rew, done, info = env.step(a)
+        cum_reward+=rew
+        ns = discretize(obs)
+        Q[(s,a)] = (1 - alpha) * Q.get((s,a),0) + alpha * (rew + gamma * max(qvalues(ns)))
+    cum_rewards.append(cum_reward)
+    rewards.append(cum_reward)
+    # == Periodically print results and calculate average reward ==
+    if epoch%5000==0:
+        print(f"{epoch}: {np.average(cum_rewards)}, alpha={alpha}, epsilon={epsilon}")
+        if np.average(cum_rewards) > Qmax:
+            Qmax = np.average(cum_rewards)
+            Qbest = Q
+        cum_rewards=[]
 ```

-✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
+What you may notice from those results:
+* We are very close achieving the goal of getting 195 cumulative reward over 100+ consecutive runs of the simulation, or we may have actually achieved it! Even if we get smaller numbers, we still do not know, because we average over 5000 runs, and only 100 runs is required in the formal criteria.
+* Sometimes the reward start to drop, which means that we can "destroy" already learnt values in Q-Table with the ones that make situation worse

-## [Topic 2]
+To make learning more stable, it makes sense to adjust some of our hyperparameters during training. In particular:
+* For **learning rate**, `alpha`, we may start with values close to 1, and then keep decreasing the parameter. With time, we will be getting good probability values in Q-Table, and thus we should be adjusting them slightly, and not overwriting completely with new values.
+* We may want to increase the `eplilon` slowly, in order to be exploring less, and expliting more. It probably makes sense to start with lower value of `epsilon`, and move up to almost 1

-## [Topic 3]
+> **Task 1**: Play with hyperparameter values and see if you can achieve higher cumulative reward. Are you getting above 195?

+> **Task 2**: To formally solve the problem, you need to get 195 average reward across 100 consecutive runs. Measure that during training and make sure that you have formally solved the problem!
+
+## Seeing the Result in Action
+
+Now it would be interesting to actually see how the trained model behaves. Let's run the simulation, and we will be following the same action selection strategy as during training: sampling according to the probability distribution in Q-Table: 
 ## 🚀Challenge

 Add a challenge for students to work on collaboratively in class to enhance the project

-Optional: add a screenshot of the completed lesson's UI if appropriate
+```python
+obs = env.reset()
+done = False
+while not done:
+   s = discretize(obs)
+   env.render()
+   v = probs(np.array(qvalues(s)))
+   a = random.choices(actions,weights=v)[0]
+   obs,_,done,_ = env.step(a)
+env.close()
+```
+
+You should see something like this:
+
+![](images/cartpole-balance.gif)
+
+> **Task 3**: Here, we were using the final copy of Q-Table, which may not be the best one. Remember that we have stored the best-performing Q-Table into `Qbest` variable! Try the same example with the best-performing Q-Table by copying `Qbest` over to `Q` and see if you notice the difference.
+
+> **Task 4**: Here we were not selecting the best action on each step, but rather sampling with corresponding probability distribution. Would it make more sense to always select the best action, with highest Q-Table value? This can be easily done by using `np.argmax` function to find out the action number corresponding to highers Q-Table value. Implement this strategy and see if it improves the balancing.
+
+## [Post-lecture quiz](link-to-quiz-app)

-## [Post-lecture quiz](link-to-quiz-app) 46
+## Assignment: [Train Mountain Car](assignment.md)

-## Review & Self Study
+## Conclusion

-## Assignment [Assignment Name](assignment.md)
+We have now learnt how to train agents to achieve good results just by providing them a reward function that defines the desired state of the game, and by giving it an opportinity to intellegently explore the search space. We have successfully applied Q-Learning algorithm in the cases of discrete and continuous environments, but with discrete actions. In the are of reinforcement learning, we need to further study situations where action state is also continuous, and when observation space is much more complex, such as the image from Atarti game screen. In those problems we often need to use more powerful machine learning techniques, such as neural networks, in order to achieve good results. Those more advanced topics are the subject of more advanced Deep Reinforcement Learning course.
--- a/8-Reinforcement/2-Gym/assignment.md
+++ b/8-Reinforcement/2-Gym/assignment.md
@ -1,9 +1,43 @@
-# [Assignment Name]
+# Train Mountain Car

+[OpenAI Gym](http://gym.openai.com) has been designed in such a way that all environments provide the same API - i.e. the same methods `reset`, `step` and `render`, and the same abstractions of **action space** and **observation space**. Thus is should be possible to adapt the same reinforcement learning algorithms to different environments with minimal code changes.
+
+## Mountain Car Environment
+
+[Mountain Car environment](https://gym.openai.com/envs/MountainCar-v0/) contains a car stuck in a valley:
+
+<img src="images/mountaincar.png" width="300"/>
+
+The goal is to get out of the valley and capture the flag, by doing at each step one of the following actions:
+
+| Value | Meaning |
+|---|---|
+| 0 | Accelerate to the left |
+| 1 | Do not accelerate |
+| 2 | Accelerate to the right |
+
+The main trick of this problem is, however, that the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.
+
+Observation space consists of just two values:
+
+| Num | Observation  | Min | Max |
+|-----|--------------|-----|-----|
+|  0  | Car Position | -1.2| 0.6 |
+|  1  | Car Velocity | -0.07 | 0.07 |
+
+Reward system for the mountain car is rather tricky:
+
+ * Reward of 0 is awarded if the agent reached the flag (position = 0.5) on top of the mountain.
+ * Reward of -1 is awarded if the position of the agent is less than 0.5.
+
+Episode terminates if the car position is more than 0.5, or episode length is greater than 200.
 ## Instructions

+Adapt our reinforcement learning algorithm to solve the mountain car problem. Start with existing [notebook.ipynb](notebook.ipynb) code, substitute new environment, change state discretization functions, and try to make existing algorithm to train with minimal code modifications. Optimize the result by adjusting hyperparameters.
+
+> **Note**: Hyperparameters adjustment is likely to be needed to make algorithm converge. 
 ## Rubric

 | Criteria | Exemplary | Adequate | Needs Improvement |
 | -------- | --------- | -------- | ----------------- |
-|          |           |          |                   |
+|          | Q-Learning algorithm is successfully adapted from CartPole example, with minimal code modifications, which is able to solve the problem of capturing the flag under 200 steps. | A new Q-Learning algorithm has been adopted from the Internet, but is well-documented; or existing algorithm adopted, but does not reach desired results | Student was not able to successfully adopt any algorithm, but has mede substantial steps towards solution (implemented state discretization, Q-Table data structure, etc.) |
--- a/8-Reinforcement/2-Gym/images/cartpole-balance.gif
+++ b/8-Reinforcement/2-Gym/images/cartpole-balance.gif
--- a/8-Reinforcement/2-Gym/images/cartpole-nobalance.gif
+++ b/8-Reinforcement/2-Gym/images/cartpole-nobalance.gif
--- a/8-Reinforcement/2-Gym/images/cartpole.png
+++ b/8-Reinforcement/2-Gym/images/cartpole.png
--- a/8-Reinforcement/2-Gym/images/mountaincar.png
+++ b/8-Reinforcement/2-Gym/images/mountaincar.png
--- a/8-Reinforcement/2-Gym/notebook.ipynb
+++ b/8-Reinforcement/2-Gym/notebook.ipynb
@ -0,0 +1,569 @@
+{
+ "metadata": {
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.6"
+  },
+  "orig_nbformat": 4,
+  "kernelspec": {
+   "name": "python376jvsc74a57bd01bafa348e6d68fe7a95d0101ce77c423dc9bd1325b58efa1f5b661fc9bc9be22",
+   "display_name": "Python 3.7.6 64-bit ('base': conda)"
+  },
+  "interpreter": {
+   "hash": "c77bccf6af5544921fca6eddbefe5e7c44ddf71c61b63c74bd828ca1d0e389a0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2,
+ "cells": [
+  {
+   "source": [
+    "## CartPole Skating\n",
+    "\n",
+    "The problem we have been solving in the previous lesson might seem like a toy problem, not really applicable for real life scenarios. This is not the case, because many real world problems are like that - including playing chess or go. They are similar, because we also have a board with given rules and **discrete state**.\n",
+    "\n",
+    "In this lesson we will apply the same principles of Q-Learning to a problem with **continuous state**, i.e. a state that is given by one or more real numbers. We will deal with the following problem:\n",
+    "\n",
+    "> **Problem**: If Peter wants to escape from the wolf, he needs to be able to move faster than him. We will see how Peter can learn to skate, in particular, to keep balance, using Q-Learning.\n",
+    "\n",
+    "We will use a simplified version of balancing known as **CartPole** problem. In cartpole world, we have a horizontal slider that can move left or right, and the goal is to balance a pole staying on top of it.\n",
+    "\n",
+    "<img src=\"images/cartpole.png\" width=\"200\"/>\n",
+    "\n",
+    "## OpenAI Gym\n",
+    "\n",
+    "In the previous lesson, the rules of the game and the state were given by `Board` class, which we defined ourselves. Here we will use a special **sumulation environment**, which will simulate the physics behind the balancing pole. One of the most popular simulation environments for training Reinforcement Learning algorithms is called [Gym](https://gym.openai.com/), which is maintained by [OpenAI](https://openai.com/). By using gym we can create difference **environments**: from cartpole simulation to Atari games. \n",
+    "\n",
+    "First, let's install the gym and import required libraries:"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "Collecting gym\n  Downloading gym-0.18.3.tar.gz (1.6 MB)\nRequirement already satisfied: scipy in c:\\winapp\\conda\\lib\\site-packages (from gym) (1.4.1)\nRequirement already satisfied: numpy>=1.10.4 in c:\\winapp\\conda\\lib\\site-packages (from gym) (1.18.1)\nCollecting pyglet<=1.5.15,>=1.4.0\n  Downloading pyglet-1.5.15-py3-none-any.whl (1.1 MB)\nRequirement already satisfied: Pillow<=8.2.0 in c:\\winapp\\conda\\lib\\site-packages (from gym) (8.2.0)\nCollecting cloudpickle<1.7.0,>=1.2.0\n  Downloading cloudpickle-1.6.0-py3-none-any.whl (23 kB)\nBuilding wheels for collected packages: gym\n  Building wheel for gym (setup.py): started\n  Building wheel for gym (setup.py): finished with status 'done'\n  Created wheel for gym: filename=gym-0.18.3-py3-none-any.whl size=1657521 sha256=23743ca1a46d6268b5aed87007ab14cf9597d1f0e9c38417574741617f6bb13f\n  Stored in directory: c:\\users\\dmitr\\appdata\\local\\pip\\cache\\wheels\\1a\\ec\\6d\\705d53925f481ab70fd48ec7728558745eeae14dfda3b49c99\nSuccessfully built gym\nInstalling collected packages: pyglet, cloudpickle, gym\nSuccessfully installed cloudpickle-1.6.0 gym-0.18.3 pyglet-1.5.15\n"
+     ]
+    }
+   ],
+   "source": [
+    "import sys\n",
+    "!{sys.executable} -m pip install gym "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gym\n",
+    "import matplotlib.pyplot as plt\n",
+    "from IPython import display\n",
+    "import numpy as np\n",
+    "import random"
+   ]
+  },
+  {
+   "source": [
+    "## CartPole Environment\n",
+    "\n",
+    "To work with CartPole balancing problem, we need to initialize corresponding environment. Each environment is associated with:\n",
+    "* **Observation space** that defines the structure of information that we receive from the environment. For cartpole problem, we receive position of the pole, velocity and some other values.\n",
+    "* **Action space** that defines possible actions. In our case action space is discrete, and consists of two actions - **left** and **right**."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "Discrete(2)\nBox(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)\n1\n"
+     ]
+    }
+   ],
+   "source": [
+    "env = gym.make(\"CartPole-v1\")\n",
+    "print(env.action_space)\n",
+    "print(env.observation_space)\n",
+    "print(env.action_space.sample())"
+   ]
+  },
+  {
+   "source": [
+    "To see how the environment works, let's run a short simulation for 100 steps. At each step, we provide one of the actions to be taken - in this simulation we just randomly select an action from `action_space`. Run the code below and see what it leads to.\n",
+    "\n",
+    "> **Note**: It is preferred to run this code locally (eg. from Visual Studio Code), in which case the simulation will open in a new window. When running the code online, you may need to make some tweaks to the code, as described [here](https://towardsdatascience.com/rendering-openai-gym-envs-on-binder-and-google-colab-536f99391cc7)."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stderr",
+     "text": [
+      "c:\\winapp\\conda\\lib\\site-packages\\gym\\logger.py:30: UserWarning: \u001b[33mWARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.\u001b[0m\n  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))\n"
+     ]
+    }
+   ],
+   "source": [
+    "env.reset()\n",
+    "\n",
+    "for i in range(100):\n",
+    "   env.render()\n",
+    "   env.step(env.action_space.sample())\n",
+    "env.close()"
+   ]
+  },
+  {
+   "source": [
+    "During simulation, we need to get observatons in order to decide how to act. In fact, `step` function returns us back current observations, reward function, and the `done` flag that indicates whether it makes sense to continue the simulation or not:"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "[ 0.03403272 -0.24301182  0.02669811  0.2895829 ] -> 1.0\n",
+      "[ 0.02917248 -0.04828055  0.03248977  0.00543839] -> 1.0\n",
+      "[ 0.02820687  0.14636075  0.03259854 -0.27681916] -> 1.0\n",
+      "[ 0.03113408  0.34100283  0.02706215 -0.55904489] -> 1.0\n",
+      "[ 0.03795414  0.53573468  0.01588125 -0.84308041] -> 1.0\n",
+      "[ 4.86688335e-02  7.30636325e-01 -9.80354340e-04 -1.13072712e+00] -> 1.0\n",
+      "[ 0.06328156  0.9257711  -0.0235949  -1.42371736] -> 1.0\n",
+      "[ 0.08179698  1.1211767  -0.05206924 -1.72368043] -> 1.0\n",
+      "[ 0.10422052  0.92668783 -0.08654285 -1.44764396] -> 1.0\n",
+      "[ 0.12275427  0.73273015 -0.11549573 -1.18320812] -> 1.0\n",
+      "[ 0.13740888  0.53928047 -0.13915989 -0.9288471 ] -> 1.0\n",
+      "[ 0.14819448  0.34628356 -0.15773684 -0.68293142] -> 1.0\n",
+      "[ 0.15512016  0.54320334 -0.17139546 -1.0208266 ] -> 1.0\n",
+      "[ 0.16598422  0.35072788 -0.191812   -0.78648764] -> 1.0\n",
+      "[ 0.17299878  0.15868546 -0.20754175 -0.55975453] -> 1.0\n",
+      "[ 0.17617249  0.35602306 -0.21873684 -0.90998894] -> 1.0\n"
+     ]
+    }
+   ],
+   "source": [
+    "env.reset()\n",
+    "\n",
+    "done = False\n",
+    "while not done:\n",
+    "   env.render()\n",
+    "   obs, rew, done, info = env.step(env.action_space.sample())\n",
+    "   print(f\"{obs} -> {rew}\")\n",
+    "env.close()"
+   ]
+  },
+  {
+   "source": [
+    "The observation vector that is returned at each step of the simulation contains the following values:\n",
+    "* Position of cart\n",
+    "* Velocity of cart\n",
+    "* Angle of pole\n",
+    "* Rotation rate of pole\n",
+    "\n",
+    "We can get min and max value of those numbers:\n"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]\n[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(env.observation_space.low)\n",
+    "print(env.observation_space.high)"
+   ]
+  },
+  {
+   "source": [
+    "You may also notice that reward value on each simulation step is always 1. This is because our goal is to survive as long as possible, i.e. keep the pole to a reasonably vertical position for the longest period of time.\n",
+    "\n",
+    "> In fact, CartPole simulation is considered solved if we manage to get the average reward of 195 over 100 consecutive trials."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "source": [
+    "## State Discretization\n",
+    "\n",
+    "In Q=Learning, we need to build Q-Table that defines what to do at each state. To be able to do this, we need state to be **discreet**, more precisely, it should contain finite number of disctete values. Thus, we need somehow to **discretize** our observations, mapping them to finite set of states.\n",
+    "\n",
+    "There are a few ways we can do this:\n",
+    "* If we know the interval of a certain value, we can divide this interval into a number of **bins**, and then replace the value by the number of bin that it belongs to. This can be done using numpy [`digitize`](https://numpy.org/doc/stable/reference/generated/numpy.digitize.html) method. In this case, we will precisely know the state size, because it will depend on the number of bins we select for digitalization.\n",
+    "* We can use linear interpolation to bring values to some finite interval (say, from -20 to 20), and then convert numbers to integers by rounding them. This gives us a bit less control on the size of the state, especially if we do not know the exact ranges of input values. For example, in our case 2 out of 4 values do not have upper/lower bounds on their values, which may result in the infinite number of states.\n",
+    "\n",
+    "In our example, we will go with the second approach. As you may notice later, despite undefined upper/lower bounds, those value rarely take values outside of certain finite intervals, thus those states with extreme values will be very rare.\n",
+    "\n",
+    "Here is the function that will take the observation from our model, and produces a tuple of 4 integer values:"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def discretize(x):\n",
+    "    return tuple((x/np.array([0.25, 0.25, 0.01, 0.1])).astype(np.int))"
+   ]
+  },
+  {
+   "source": [
+    "Let's also explore other discretization method using bins:"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "Sample bins for interval (-5,5) with 10 bins\n [-5. -4. -3. -2. -1.  0.  1.  2.  3.  4.  5.]\n"
+     ]
+    }
+   ],
+   "source": [
+    "def create_bins(i,num):\n",
+    "    return np.arange(num+1)*(i[1]-i[0])/num+i[0]\n",
+    "\n",
+    "print(\"Sample bins for interval (-5,5) with 10 bins\\n\",create_bins((-5,5),10))\n",
+    "\n",
+    "ints = [(-5,5),(-2,2),(-0.5,0.5),(-2,2)] # intervals of values for each parameter\n",
+    "nbins = [20,20,10,10] # number of bins for each parameter\n",
+    "bins = [create_bins(ints[i],nbins[i]) for i in range(4)]\n",
+    "\n",
+    "def discretize_bins(x):\n",
+    "    return tuple(np.digitize(x[i],bins[i]) for i in range(4))"
+   ]
+  },
+  {
+   "source": [
+    "Let's now run a short simulation and observe those discrete environemnt values. Feel free to try both `discretize` and `discretize_bins` and see if there is a difference.\n",
+    "\n",
+    "> **Note**: `discretize_bins` returns the bin number, which is 0-based, thus for values of input variable around 0 it returns the number from the middle of the interval (10). In `discretize`, we did not care about the range of output values, allowing them to be negative, thus the state values are not shifted, and 0 corresponds to 0."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "(0, 0, 3, 3)\n(0, 0, 4, 0)\n(0, 0, 4, 3)\n(0, 0, 4, 0)\n(0, 0, 4, 3)\n(0, 0, 5, 0)\n(0, 0, 5, 3)\n(0, -1, 6, 7)\n(0, -2, 7, 10)\n(0, -3, 10, 13)\n(0, -3, 12, 16)\n(0, -4, 15, 19)\n(0, -3, 19, 17)\n(0, -4, 23, 20)\n"
+     ]
+    }
+   ],
+   "source": [
+    "env.reset()\n",
+    "\n",
+    "done = False\n",
+    "while not done:\n",
+    "   #env.render()\n",
+    "   obs, rew, done, info = env.step(env.action_space.sample())\n",
+    "   #print(discretize_bins(obs))\n",
+    "   print(discretize(obs))\n",
+    "env.close()"
+   ]
+  },
+  {
+   "source": [
+    "## Q-Table Structure\n",
+    "\n",
+    "In our previous lesson, the state was a simple pair of numbers from 0 to 8, and thus it was convenient to represent Q-Table by numpy tensor with shape 8x8x2. If we use bins discretization, the size of our state vector is also known, so we can use the same approach and represent state by an array of shape 20x20x10x10x2 (here 2 is the dimension of action space, and first dimensions correspond to the number of bins we have selected to use for each of the parameters in observation space).\n",
+    "\n",
+    "However, sometimes precise dimensions of the observation space are not known. In case of `discretize` function, we may never be sure that our state stays within certain limits, because some of the original values are not boind. Thus, we will use slightly different approach and represent Q-Table by a dictionary. We will use the pair *(state,action)* as the dictionary key, and the value would correspond to Q-Table entry value. "
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "Q = {}\n",
+    "actions = (0,1)\n",
+    "\n",
+    "def qvalues(state):\n",
+    "    return [Q.get((state,a),0) for a in actions]"
+   ]
+  },
+  {
+   "source": [
+    "Here we also define a function `qvalues`, which returns a list of Q-Table values for a given state that correspond to all possible actions. If the entry is not present in the Q-Table, we will return 0 as the default.\n",
+    "\n",
+    "## Let's Start Q-Learning!\n",
+    "\n",
+    "Now we are ready to teach Peter to balance! First, let's set some hyperparameterers:"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# hyperparameters\n",
+    "alpha = 0.3\n",
+    "gamma = 0.90\n",
+    "epsilon = 0.9"
+   ]
+  },
+  {
+   "source": [
+    "Here, `alpha` is the **learning rate** that defines to which extent we should adjust the current values of Q-Table at each step. In previous lesson we have started with 1, and then decreased `alpha` to lower values during training. In this example we will keep it constant just for simplicity, and you can experiment with adjusting `alpha` values later.\n",
+    "\n",
+    "`gamma` is the **discount factor** that shows to which extent we should prioritize future reward over current reward.\n",
+    "\n",
+    "`epsilon` is the **exploration/exploitation factor** that determines whether we should prefer exploration to exploitation or vice versa. In our algorithm, we will in `epsilon` percent of the cases select the next action according to Q-Table values, and in the remaining number of cases we will execute random action. This will allow us to explore the areas of search space that we have never seen before. \n",
+    "\n",
+    "> In terms of balancing - chosing random action (exploration) would act as a random punch in the wrong direction, and the pole would have to learn how to recover the balance from those \"mistakes\"\n",
+    "\n",
+    "We would also make two improvements to our algorithm from the previous lesson:\n",
+    "\n",
+    "* Calculating average cumulative reward over a number of simulations. We will print the progress each 5000 iterations, and we will average out our cumulative reward over that period of time. It means that if we get more than 195 point - we can consider the problem solved, with even higher quality than required.\n",
+    "* We will calculate maximim average cumulative result `Qmax`, and we will store the Q-Table corresponding to that result. When you run the training you will notice that sometimes the average cumulative result starts to drop, and we want to keep the values of Q-Table that correspond to the best model observed during training.\n",
+    "\n",
+    "We will also collect all cumulative rewards at each simulaiton at `rewards` vector for further plotting."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "metadata": {},
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "0: 36.0, alpha=0.3, epsilon=0.9\n",
+      "5000: 79.9322, alpha=0.3, epsilon=0.9\n",
+      "10000: 136.5074, alpha=0.3, epsilon=0.9\n",
+      "15000: 166.0206, alpha=0.3, epsilon=0.9\n",
+      "20000: 179.561, alpha=0.3, epsilon=0.9\n",
+      "25000: 195.6424, alpha=0.3, epsilon=0.9\n",
+      "30000: 213.3106, alpha=0.3, epsilon=0.9\n",
+      "35000: 227.8582, alpha=0.3, epsilon=0.9\n",
+      "40000: 230.849, alpha=0.3, epsilon=0.9\n",
+      "45000: 246.6194, alpha=0.3, epsilon=0.9\n",
+      "50000: 270.2226, alpha=0.3, epsilon=0.9\n",
+      "55000: 266.2084, alpha=0.3, epsilon=0.9\n",
+      "60000: 281.3548, alpha=0.3, epsilon=0.9\n",
+      "65000: 285.2666, alpha=0.3, epsilon=0.9\n",
+      "70000: 298.7658, alpha=0.3, epsilon=0.9\n",
+      "75000: 314.9734, alpha=0.3, epsilon=0.9\n",
+      "80000: 325.5224, alpha=0.3, epsilon=0.9\n",
+      "85000: 325.1302, alpha=0.3, epsilon=0.9\n",
+      "90000: 330.4744, alpha=0.3, epsilon=0.9\n",
+      "95000: 309.4724, alpha=0.3, epsilon=0.9\n"
+     ]
+    }
+   ],
+   "source": [
+    "def probs(v,eps=1e-4):\n",
+    "    v = v-v.min()+eps\n",
+    "    v = v/v.sum()\n",
+    "    return v\n",
+    "\n",
+    "random.seed(13)\n",
+    "\n",
+    "Qmax = 0\n",
+    "cum_rewards = []\n",
+    "rewards = []\n",
+    "for epoch in range(100000):\n",
+    "    obs = env.reset()\n",
+    "    done = False\n",
+    "    cum_reward=0\n",
+    "    # == do the simulation ==\n",
+    "    while not done:\n",
+    "        s = discretize(obs)\n",
+    "        if random.random()<epsilon:\n",
+    "            # exploitation - chose the action according to Q-Table probabilities\n",
+    "            v = probs(np.array(qvalues(s)))\n",
+    "            a = random.choices(actions,weights=v)[0]\n",
+    "        else:\n",
+    "            # exploration - randomly chose the action\n",
+    "            a = np.random.randint(env.action_space.n)\n",
+    "\n",
+    "        obs, rew, done, info = env.step(a)\n",
+    "        cum_reward+=rew\n",
+    "        ns = discretize(obs)\n",
+    "        Q[(s,a)] = (1 - alpha) * Q.get((s,a),0) + alpha * (rew + gamma * max(qvalues(ns)))\n",
+    "    cum_rewards.append(cum_reward)\n",
+    "    rewards.append(cum_reward)\n",
+    "    # == Periodically print results and calculate average reward ==\n",
+    "    if epoch%5000==0:\n",
+    "        print(f\"{epoch}: {np.average(cum_rewards)}, alpha={alpha}, epsilon={epsilon}\")\n",
+    "        if np.average(cum_rewards) > Qmax:\n",
+    "            Qmax = np.average(cum_rewards)\n",
+    "            Qbest = Q\n",
+    "        cum_rewards=[]"
+   ]
+  },
+  {
+   "source": [
+    "What you may notice from those results:\n",
+    "* We are very close achieving the goal of getting 195 cumulative reward over 100+ consecutive runs of the simulation, or we may have actually achieved it! Even if we get smaller numbers, we still do not know, because we average over 5000 runs, and only 100 runs is required in the formal criteria.\n",
+    "* Sometimes the reward start to drop, which means that we can \"destroy\" already learnt values in Q-Table with the ones that make situation worse\n",
+    "\n",
+    "To make learning more stable, it makes sense to adjust some of our hyperparameters during training. In particular:\n",
+    "* For **learning rate**, `alpha`, we may start with values close to 1, and then keep decreasing the parameter. With time, we will be getting good probability values in Q-Table, and thus we should be adjusting them slightly, and not overwriting completely with new values.\n",
+    "* We may want to increase the `eplilon` slowly, in order to be exploring less, and expliting more. It probably makes sense to start with lower value of `epsilon`, and move up to almost 1\n",
+    "\n",
+    "> **Task 1**: Play with hyperparameter values and see if you can achieve higher cumulative reward. Are you getting above 195?\n",
+    "\n",
+    "> **Task 2**: To formally solve the problem, you need to get 195 average reward across 100 consecutive runs. Measure that during training and make sure that you have formally solved the problem!"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "source": [
+    "## Seeing the Result in Action\n",
+    "\n",
+    "Now it would be interesting to actually see how the trained model behaves. Let's run the simulation, and we will be following the same action selection strategy as during training: sampling according to the probability distribution in Q-Table: "
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "obs = env.reset()\n",
+    "done = False\n",
+    "while not done:\n",
+    "   s = discretize(obs)\n",
+    "   env.render()\n",
+    "   v = probs(np.array(qvalues(s)))\n",
+    "   a = random.choices(actions,weights=v)[0]\n",
+    "   obs,_,done,_ = env.step(a)\n",
+    "env.close()"
+   ]
+  },
+  {
+   "source": [
+    "> **Task 3**: Here, we were using the final copy of Q-Table, which may not be the best one. Remember that we have stored the best-performing Q-Table into `Qbest` variable! Try the same example with the best-performing Q-Table by copying `Qbest` over to `Q` and see if you notice the difference.\n",
+    "\n",
+    "> **Task 4**: Here we were not selecting the best action on each step, but rather sampling with corresponding probability distribution. Would it make more sense to always select the best action, with highest Q-Table value? This can be easily done by using `np.argmax` function to find out the action number corresponding to highers Q-Table value. Implement this strategy and see if it improves the balancing.\n",
+    "\n",
+    "## Saving result to animated GIF\n",
+    "\n",
+    "If you want to impress your friends, you may want to send them the animated GIF picture of the balancing pole. To do this, we can invoke `env.render` to produce an image frame, and then save those to animated GIF using PIL library:"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "159\n"
+     ]
+    }
+   ],
+   "source": [
+    "from PIL import Image\n",
+    "obs = env.reset()\n",
+    "done = False\n",
+    "i=0\n",
+    "ims = []\n",
+    "while not done:\n",
+    "   s = discretize(obs)\n",
+    "   img=env.render(mode='rgb_array')\n",
+    "   ims.append(Image.fromarray(img))\n",
+    "   v = probs(np.array([Qbest.get((s,a),0) for a in actions]))\n",
+    "   a = random.choices(actions,weights=v)[0]\n",
+    "   obs,_,done,_ = env.step(a)\n",
+    "   i+=1\n",
+    "env.close()\n",
+    "ims[0].save('images/cartpole-balance.gif',save_all=True,append_images=ims[1::2],loop=0,duration=5)\n",
+    "print(i)"
+   ]
+  },
+  {
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "We have now learnt how to train agents to achieve good results just by providing them a reward function that defines the desired state of the game, and by giving it an opportinity to intellegently explore the search space. We have successfully applied Q-Learning algorithm in the cases of discrete and continuous environments, but with discrete actions. In the are of reinforcement learning, we need to further study situations where action state is also continuous, and when observation space is much more complex, such as the image from Atarti game screen. In those problems we often need to use more powerful machine learning techniques, such as neural networks, in order to achieve good results. Those more advanced topics are the subject of more advanced Deep Reinforcement Learning course."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ]
+}
--- a/8-Reinforcement/2-Gym/solution/CartPole.ipynb
+++ b/8-Reinforcement/2-Gym/solution/CartPole.ipynb