diff --git a/8-Reinforcement/2-Gym/README.md b/8-Reinforcement/2-Gym/README.md index 787cff628..21c3269a0 100644 --- a/8-Reinforcement/2-Gym/README.md +++ b/8-Reinforcement/2-Gym/README.md @@ -236,6 +236,27 @@ What you may notice from those results: * We are very close achieving the goal of getting 195 cumulative reward over 100+ consecutive runs of the simulation, or we may have actually achieved it! Even if we get smaller numbers, we still do not know, because we average over 5000 runs, and only 100 runs is required in the formal criteria. * Sometimes the reward start to drop, which means that we can "destroy" already learnt values in Q-Table with the ones that make situation worse +This is more clearly visible if we plot training progress. + +## Plotting Training Progress + +During training, we have collected the cumulative reward value at each of the iterations into `rewards` vector. Here is how it looks when we plot it against the iteration number: + +![](images/train_progress_raw.png) + +From this graph, it is not possible to tell anything, because due to the nature of stochastic training process the length of training sessions varies greatly. To make more sense of this graph, we can calculate **running average** over series of experiments, let's say 100. This can be done conveniently using `np.convolve`: + +```python +def running_average(x,window): + return np.convolve(x,np.ones(window)/window,mode='valid') + +plt.plot(running_average(rewards,100)) +``` + +![](images/train_progress_runav.png) + +## Varying Hyperparameters + To make learning more stable, it makes sense to adjust some of our hyperparameters during training. In particular: * For **learning rate**, `alpha`, we may start with values close to 1, and then keep decreasing the parameter. With time, we will be getting good probability values in Q-Table, and thus we should be adjusting them slightly, and not overwriting completely with new values. * We may want to increase the `eplilon` slowly, in order to be exploring less, and expliting more. It probably makes sense to start with lower value of `epsilon`, and move up to almost 1 @@ -247,9 +268,6 @@ To make learning more stable, it makes sense to adjust some of our hyperparamete ## Seeing the Result in Action Now it would be interesting to actually see how the trained model behaves. Let's run the simulation, and we will be following the same action selection strategy as during training: sampling according to the probability distribution in Q-Table: -## 🚀Challenge - -Add a challenge for students to work on collaboratively in class to enhance the project ```python obs = env.reset() @@ -277,4 +295,4 @@ You should see something like this: ## Conclusion -We have now learnt how to train agents to achieve good results just by providing them a reward function that defines the desired state of the game, and by giving it an opportinity to intellegently explore the search space. We have successfully applied Q-Learning algorithm in the cases of discrete and continuous environments, but with discrete actions. In the are of reinforcement learning, we need to further study situations where action state is also continuous, and when observation space is much more complex, such as the image from Atarti game screen. In those problems we often need to use more powerful machine learning techniques, such as neural networks, in order to achieve good results. Those more advanced topics are the subject of more advanced Deep Reinforcement Learning course. \ No newline at end of file +We have now learnt how to train agents to achieve good results just by providing them a reward function that defines the desired state of the game, and by giving it an opportunity to intelligently explore the search space. We have successfully applied Q-Learning algorithm in the cases of discrete and continuous environments, but with discrete actions. In the are of reinforcement learning, we need to further study situations where action state is also continuous, and when observation space is much more complex, such as the image from Atari game screen. In those problems we often need to use more powerful machine learning techniques, such as neural networks, in order to achieve good results. Those more advanced topics are the subject of more advanced Deep Reinforcement Learning course. \ No newline at end of file diff --git a/8-Reinforcement/2-Gym/images/cartpole-balance.gif b/8-Reinforcement/2-Gym/images/cartpole-balance.gif index 836845a8b..52dbd2043 100644 Binary files a/8-Reinforcement/2-Gym/images/cartpole-balance.gif and b/8-Reinforcement/2-Gym/images/cartpole-balance.gif differ diff --git a/8-Reinforcement/2-Gym/images/train_progress_raw.png b/8-Reinforcement/2-Gym/images/train_progress_raw.png new file mode 100644 index 000000000..16a698228 Binary files /dev/null and b/8-Reinforcement/2-Gym/images/train_progress_raw.png differ diff --git a/8-Reinforcement/2-Gym/images/train_progress_runav.png b/8-Reinforcement/2-Gym/images/train_progress_runav.png new file mode 100644 index 000000000..4eccb762c Binary files /dev/null and b/8-Reinforcement/2-Gym/images/train_progress_runav.png differ diff --git a/8-Reinforcement/2-Gym/notebook.ipynb b/8-Reinforcement/2-Gym/notebook.ipynb index 30951e7bd..9f421aace 100644 --- a/8-Reinforcement/2-Gym/notebook.ipynb +++ b/8-Reinforcement/2-Gym/notebook.ipynb @@ -10,15 +10,15 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.6" + "version": "3.7.4" }, "orig_nbformat": 4, "kernelspec": { - "name": "python376jvsc74a57bd01bafa348e6d68fe7a95d0101ce77c423dc9bd1325b58efa1f5b661fc9bc9be22", - "display_name": "Python 3.7.6 64-bit ('base': conda)" + "name": "python3", + "display_name": "Python 3.7.4 64-bit ('base': conda)" }, "interpreter": { - "hash": "c77bccf6af5544921fca6eddbefe5e7c44ddf71c61b63c74bd828ca1d0e389a0" + "hash": "86193a1ab0ba47eac1c69c1756090baa3b420b3eea7d4aafab8b85f8b312f0c5" } }, "nbformat": 4, @@ -56,7 +56,7 @@ "output_type": "stream", "name": "stdout", "text": [ - "Collecting gym\n Downloading gym-0.18.3.tar.gz (1.6 MB)\nRequirement already satisfied: scipy in c:\\winapp\\conda\\lib\\site-packages (from gym) (1.4.1)\nRequirement already satisfied: numpy>=1.10.4 in c:\\winapp\\conda\\lib\\site-packages (from gym) (1.18.1)\nCollecting pyglet<=1.5.15,>=1.4.0\n Downloading pyglet-1.5.15-py3-none-any.whl (1.1 MB)\nRequirement already satisfied: Pillow<=8.2.0 in c:\\winapp\\conda\\lib\\site-packages (from gym) (8.2.0)\nCollecting cloudpickle<1.7.0,>=1.2.0\n Downloading cloudpickle-1.6.0-py3-none-any.whl (23 kB)\nBuilding wheels for collected packages: gym\n Building wheel for gym (setup.py): started\n Building wheel for gym (setup.py): finished with status 'done'\n Created wheel for gym: filename=gym-0.18.3-py3-none-any.whl size=1657521 sha256=23743ca1a46d6268b5aed87007ab14cf9597d1f0e9c38417574741617f6bb13f\n Stored in directory: c:\\users\\dmitr\\appdata\\local\\pip\\cache\\wheels\\1a\\ec\\6d\\705d53925f481ab70fd48ec7728558745eeae14dfda3b49c99\nSuccessfully built gym\nInstalling collected packages: pyglet, cloudpickle, gym\nSuccessfully installed cloudpickle-1.6.0 gym-0.18.3 pyglet-1.5.15\n" + "Requirement already satisfied: gym in c:\\winapp\\miniconda3\\lib\\site-packages (0.18.3)\nRequirement already satisfied: pyglet<=1.5.15,>=1.4.0 in c:\\winapp\\miniconda3\\lib\\site-packages (from gym) (1.5.15)\nRequirement already satisfied: cloudpickle<1.7.0,>=1.2.0 in c:\\winapp\\miniconda3\\lib\\site-packages (from gym) (1.2.2)\nRequirement already satisfied: Pillow<=8.2.0 in c:\\winapp\\miniconda3\\lib\\site-packages (from gym) (7.2.0)\nRequirement already satisfied: scipy in c:\\winapp\\miniconda3\\lib\\site-packages (from gym) (1.6.1)\nRequirement already satisfied: numpy>=1.10.4 in c:\\winapp\\miniconda3\\lib\\site-packages (from gym) (1.19.5)\n" ] } ], @@ -67,7 +67,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -91,7 +91,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 2, "metadata": {}, "outputs": [ { @@ -120,14 +120,14 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 3, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ - "c:\\winapp\\conda\\lib\\site-packages\\gym\\logger.py:30: UserWarning: \u001b[33mWARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.\u001b[0m\n warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))\n" + "C:\\winapp\\miniconda3\\lib\\site-packages\\gym\\logger.py:30: UserWarning: \u001b[33mWARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.\u001b[0m\n warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))\n" ] } ], @@ -142,36 +142,34 @@ }, { "source": [ - "During simulation, we need to get observatons in order to decide how to act. In fact, `step` function returns us back current observations, reward function, and the `done` flag that indicates whether it makes sense to continue the simulation or not:" + "During simulation, we need to get observations in order to decide how to act. In fact, `step` function returns us back current observations, reward function, and the `done` flag that indicates whether it makes sense to continue the simulation or not:" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ - "[ 0.03403272 -0.24301182 0.02669811 0.2895829 ] -> 1.0\n", - "[ 0.02917248 -0.04828055 0.03248977 0.00543839] -> 1.0\n", - "[ 0.02820687 0.14636075 0.03259854 -0.27681916] -> 1.0\n", - "[ 0.03113408 0.34100283 0.02706215 -0.55904489] -> 1.0\n", - "[ 0.03795414 0.53573468 0.01588125 -0.84308041] -> 1.0\n", - "[ 4.86688335e-02 7.30636325e-01 -9.80354340e-04 -1.13072712e+00] -> 1.0\n", - "[ 0.06328156 0.9257711 -0.0235949 -1.42371736] -> 1.0\n", - "[ 0.08179698 1.1211767 -0.05206924 -1.72368043] -> 1.0\n", - "[ 0.10422052 0.92668783 -0.08654285 -1.44764396] -> 1.0\n", - "[ 0.12275427 0.73273015 -0.11549573 -1.18320812] -> 1.0\n", - "[ 0.13740888 0.53928047 -0.13915989 -0.9288471 ] -> 1.0\n", - "[ 0.14819448 0.34628356 -0.15773684 -0.68293142] -> 1.0\n", - "[ 0.15512016 0.54320334 -0.17139546 -1.0208266 ] -> 1.0\n", - "[ 0.16598422 0.35072788 -0.191812 -0.78648764] -> 1.0\n", - "[ 0.17299878 0.15868546 -0.20754175 -0.55975453] -> 1.0\n", - "[ 0.17617249 0.35602306 -0.21873684 -0.90998894] -> 1.0\n" + "[-0.01781364 0.16446158 0.00575593 -0.26601863] -> 1.0\n", + "[-1.45244123e-02 3.59500908e-01 4.35556587e-04 -5.56880543e-01] -> 1.0\n", + "[-0.00733439 0.55461674 -0.01070205 -0.84942621] -> 1.0\n", + "[ 0.00375794 0.35964236 -0.02769058 -0.56012774] -> 1.0\n", + "[ 0.01095079 0.16491978 -0.03889313 -0.27629582] -> 1.0\n", + "[ 0.01424918 0.36057441 -0.04441905 -0.58098753] -> 1.0\n", + "[ 0.02146067 0.5562897 -0.0560388 -0.8873258 ] -> 1.0\n", + "[ 0.03258647 0.75212567 -0.07378532 -1.19708542] -> 1.0\n", + "[ 0.04762898 0.55803219 -0.09772702 -0.92841056] -> 1.0\n", + "[ 0.05878962 0.75432799 -0.11629524 -1.25013537] -> 1.0\n", + "[ 0.07387618 0.56087255 -0.14129794 -0.99602608] -> 1.0\n", + "[ 0.08509363 0.75757231 -0.16121846 -1.32953877] -> 1.0\n", + "[ 0.10024508 0.56480922 -0.18780924 -1.09133681] -> 1.0\n", + "[ 0.11154126 0.76184222 -0.20963598 -1.43658114] -> 1.0\n" ] } ], @@ -201,7 +199,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -245,7 +243,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -262,7 +260,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -298,14 +296,14 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 8, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ - "(0, 0, 3, 3)\n(0, 0, 4, 0)\n(0, 0, 4, 3)\n(0, 0, 4, 0)\n(0, 0, 4, 3)\n(0, 0, 5, 0)\n(0, 0, 5, 3)\n(0, -1, 6, 7)\n(0, -2, 7, 10)\n(0, -3, 10, 13)\n(0, -3, 12, 16)\n(0, -4, 15, 19)\n(0, -3, 19, 17)\n(0, -4, 23, 20)\n" + "(0, 0, -2, -2)\n(0, 1, -2, -5)\n(0, 2, -3, -8)\n(0, 3, -5, -11)\n(0, 3, -7, -14)\n(0, 4, -10, -17)\n(0, 3, -14, -15)\n(0, 3, -17, -12)\n(0, 3, -20, -16)\n(0, 4, -23, -19)\n" ] } ], @@ -327,14 +325,14 @@ "\n", "In our previous lesson, the state was a simple pair of numbers from 0 to 8, and thus it was convenient to represent Q-Table by numpy tensor with shape 8x8x2. If we use bins discretization, the size of our state vector is also known, so we can use the same approach and represent state by an array of shape 20x20x10x10x2 (here 2 is the dimension of action space, and first dimensions correspond to the number of bins we have selected to use for each of the parameters in observation space).\n", "\n", - "However, sometimes precise dimensions of the observation space are not known. In case of `discretize` function, we may never be sure that our state stays within certain limits, because some of the original values are not boind. Thus, we will use slightly different approach and represent Q-Table by a dictionary. We will use the pair *(state,action)* as the dictionary key, and the value would correspond to Q-Table entry value. " + "However, sometimes precise dimensions of the observation space are not known. In case of `discretize` function, we may never be sure that our state stays within certain limits, because some of the original values are not bound. Thus, we will use slightly different approach and represent Q-Table by a dictionary. We will use the pair *(state,action)* as the dictionary key, and the value would correspond to Q-Table entry value. " ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ @@ -358,7 +356,7 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 13, "metadata": {}, "outputs": [], "source": [ @@ -376,47 +374,47 @@ "\n", "`epsilon` is the **exploration/exploitation factor** that determines whether we should prefer exploration to exploitation or vice versa. In our algorithm, we will in `epsilon` percent of the cases select the next action according to Q-Table values, and in the remaining number of cases we will execute random action. This will allow us to explore the areas of search space that we have never seen before. \n", "\n", - "> In terms of balancing - chosing random action (exploration) would act as a random punch in the wrong direction, and the pole would have to learn how to recover the balance from those \"mistakes\"\n", + "> In terms of balancing - choosing random action (exploration) would act as a random punch in the wrong direction, and the pole would have to learn how to recover the balance from those \"mistakes\"\n", "\n", "We would also make two improvements to our algorithm from the previous lesson:\n", "\n", "* Calculating average cumulative reward over a number of simulations. We will print the progress each 5000 iterations, and we will average out our cumulative reward over that period of time. It means that if we get more than 195 point - we can consider the problem solved, with even higher quality than required.\n", - "* We will calculate maximim average cumulative result `Qmax`, and we will store the Q-Table corresponding to that result. When you run the training you will notice that sometimes the average cumulative result starts to drop, and we want to keep the values of Q-Table that correspond to the best model observed during training.\n", + "* We will calculate maximum average cumulative result `Qmax`, and we will store the Q-Table corresponding to that result. When you run the training, you will notice that sometimes the average cumulative result starts to drop, and we want to keep the values of Q-Table that correspond to the best model observed during training.\n", "\n", - "We will also collect all cumulative rewards at each simulaiton at `rewards` vector for further plotting." + "We will also collect all cumulative rewards at each simulation in `rewards` vector for further plotting." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 14, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ - "0: 36.0, alpha=0.3, epsilon=0.9\n", - "5000: 79.9322, alpha=0.3, epsilon=0.9\n", - "10000: 136.5074, alpha=0.3, epsilon=0.9\n", - "15000: 166.0206, alpha=0.3, epsilon=0.9\n", - "20000: 179.561, alpha=0.3, epsilon=0.9\n", - "25000: 195.6424, alpha=0.3, epsilon=0.9\n", - "30000: 213.3106, alpha=0.3, epsilon=0.9\n", - "35000: 227.8582, alpha=0.3, epsilon=0.9\n", - "40000: 230.849, alpha=0.3, epsilon=0.9\n", - "45000: 246.6194, alpha=0.3, epsilon=0.9\n", - "50000: 270.2226, alpha=0.3, epsilon=0.9\n", - "55000: 266.2084, alpha=0.3, epsilon=0.9\n", - "60000: 281.3548, alpha=0.3, epsilon=0.9\n", - "65000: 285.2666, alpha=0.3, epsilon=0.9\n", - "70000: 298.7658, alpha=0.3, epsilon=0.9\n", - "75000: 314.9734, alpha=0.3, epsilon=0.9\n", - "80000: 325.5224, alpha=0.3, epsilon=0.9\n", - "85000: 325.1302, alpha=0.3, epsilon=0.9\n", - "90000: 330.4744, alpha=0.3, epsilon=0.9\n", - "95000: 309.4724, alpha=0.3, epsilon=0.9\n" + "0: 22.0, alpha=0.3, epsilon=0.9\n", + "5000: 70.1384, alpha=0.3, epsilon=0.9\n", + "10000: 121.8586, alpha=0.3, epsilon=0.9\n", + "15000: 149.6368, alpha=0.3, epsilon=0.9\n", + "20000: 168.2782, alpha=0.3, epsilon=0.9\n", + "25000: 196.7356, alpha=0.3, epsilon=0.9\n", + "30000: 220.7614, alpha=0.3, epsilon=0.9\n", + "35000: 233.2138, alpha=0.3, epsilon=0.9\n", + "40000: 248.22, alpha=0.3, epsilon=0.9\n", + "45000: 264.636, alpha=0.3, epsilon=0.9\n", + "50000: 276.926, alpha=0.3, epsilon=0.9\n", + "55000: 277.9438, alpha=0.3, epsilon=0.9\n", + "60000: 248.881, alpha=0.3, epsilon=0.9\n", + "65000: 272.529, alpha=0.3, epsilon=0.9\n", + "70000: 281.7972, alpha=0.3, epsilon=0.9\n", + "75000: 284.2844, alpha=0.3, epsilon=0.9\n", + "80000: 269.667, alpha=0.3, epsilon=0.9\n", + "85000: 273.8652, alpha=0.3, epsilon=0.9\n", + "90000: 278.2466, alpha=0.3, epsilon=0.9\n", + "95000: 269.1736, alpha=0.3, epsilon=0.9\n" ] } ], @@ -427,6 +425,8 @@ " return v\n", "\n", "random.seed(13)\n", + "np.random.seed(13)\n", + "env.seed(13)\n", "\n", "Qmax = 0\n", "cum_rewards = []\n", @@ -467,19 +467,99 @@ "* We are very close achieving the goal of getting 195 cumulative reward over 100+ consecutive runs of the simulation, or we may have actually achieved it! Even if we get smaller numbers, we still do not know, because we average over 5000 runs, and only 100 runs is required in the formal criteria.\n", "* Sometimes the reward start to drop, which means that we can \"destroy\" already learnt values in Q-Table with the ones that make situation worse\n", "\n", - "To make learning more stable, it makes sense to adjust some of our hyperparameters during training. In particular:\n", - "* For **learning rate**, `alpha`, we may start with values close to 1, and then keep decreasing the parameter. With time, we will be getting good probability values in Q-Table, and thus we should be adjusting them slightly, and not overwriting completely with new values.\n", - "* We may want to increase the `eplilon` slowly, in order to be exploring less, and expliting more. It probably makes sense to start with lower value of `epsilon`, and move up to almost 1\n", + "This is more clearly visible if we plot training progress.\n", "\n", - "> **Task 1**: Play with hyperparameter values and see if you can achieve higher cumulative reward. Are you getting above 195?\n", + "## Plotting Training Progress\n", "\n", - "> **Task 2**: To formally solve the problem, you need to get 195 average reward across 100 consecutive runs. Measure that during training and make sure that you have formally solved the problem!" + "During training, we have collected the cumulative reward value at each of the iterations into `rewards` vector. Here is how it looks when we plot it against the iteration number:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[]" + ] + }, + "metadata": {}, + "execution_count": 20 + }, + { + "output_type": "display_data", + "data": { + "text/plain": "
", + "image/svg+xml": "\r\n\r\n\r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n", + "image/png": "\n" + }, + "metadata": { + "needs_background": "light" + } + } + ], + "source": [ + "plt.plot(rewards)" + ] + }, + { + "source": [ + "From this graph, it is not possible to tell anything, because due to the nature of stochastic training process the length of training sessions varies greatly. To make more sense of this graph, we can calculate **running average** over series of experiments, let's say 100. This can be done conveniently using `np.convolve`:" ], "cell_type": "markdown", "metadata": {} }, { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[]" + ] + }, + "metadata": {}, + "execution_count": 22 + }, + { + "output_type": "display_data", + "data": { + "text/plain": "
", + "image/svg+xml": "\r\n\r\n\r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n", + "image/png": "\n" + }, + "metadata": { + "needs_background": "light" + } + } + ], "source": [ + "def running_average(x,window):\n", + " return np.convolve(x,np.ones(window)/window,mode='valid')\n", + "\n", + "plt.plot(running_average(rewards,100))" + ] + }, + { + "source": [ + "## Varying Hyperparameters\n", + "\n", + "To make learning more stable, it makes sense to adjust some of our hyperparameters during training. In particular:\n", + "* For **learning rate**, `alpha`, we may start with values close to 1, and then keep decreasing the parameter. With time, we will be getting good probability values in Q-Table, and thus we should be adjusting them slightly, and not overwriting completely with new values.\n", + "* We may want to increase the `eplilon` slowly, in order to be exploring less, and expliting more. It probably makes sense to start with lower value of `epsilon`, and move up to almost 1\n", + "\n", + "> **Task 1**: Play with hyperparameter values and see if you can achieve higher cumulative reward. Are you getting above 195?\n", + "\n", + "> **Task 2**: To formally solve the problem, you need to get 195 average reward across 100 consecutive runs. Measure that during training and make sure that you have formally solved the problem!\n", + "\n", "## Seeing the Result in Action\n", "\n", "Now it would be interesting to actually see how the trained model behaves. Let's run the simulation, and we will be following the same action selection strategy as during training: sampling according to the probability distribution in Q-Table: " @@ -489,7 +569,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 23, "metadata": {}, "outputs": [], "source": [ @@ -519,14 +599,14 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 26, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ - "159\n" + "360\n" ] } ], @@ -553,7 +633,7 @@ "source": [ "## Conclusion\n", "\n", - "We have now learnt how to train agents to achieve good results just by providing them a reward function that defines the desired state of the game, and by giving it an opportinity to intellegently explore the search space. We have successfully applied Q-Learning algorithm in the cases of discrete and continuous environments, but with discrete actions. In the are of reinforcement learning, we need to further study situations where action state is also continuous, and when observation space is much more complex, such as the image from Atarti game screen. In those problems we often need to use more powerful machine learning techniques, such as neural networks, in order to achieve good results. Those more advanced topics are the subject of more advanced Deep Reinforcement Learning course." + "We have now learnt how to train agents to achieve good results just by providing them a reward function that defines the desired state of the game, and by giving it an opportunity to intelligently explore the search space. We have successfully applied Q-Learning algorithm in the cases of discrete and continuous environments, but with discrete actions. In the are of reinforcement learning, we need to further study situations where action state is also continuous, and when observation space is much more complex, such as the image from Atari game screen. In those problems we often need to use more powerful machine learning techniques, such as neural networks, in order to achieve good results. Those more advanced topics are the subject of more advanced Deep Reinforcement Learning course." ], "cell_type": "markdown", "metadata": {}