editing solution and notebook for readability and to not dupe lesson

pull/46/head
Jen Looper 3 years ago
parent 85689001e3
commit a602a750f5

@ -118,7 +118,7 @@
],
"cell_type": "code",
"metadata": {},
"execution_count": 6,
"execution_count": null,
"outputs": []
},
{
@ -237,7 +237,7 @@
},
{
"source": [
"We add small amount `eps` to the original vector in order to avoid division by 0 in the initial case, when all components of the vector are identical.\n",
"We add a small amount of `eps` to the original vector in order to avoid division by 0 in the initial case, when all components of the vector are identical.\n",
"\n",
"The actual learning algorithm we will run for 5000 experiments, also called **epochs**: "
],
@ -354,7 +354,7 @@
},
{
"source": [
"## Investigating Learning Process"
"## Investigating the Learning Process"
],
"cell_type": "markdown",
"metadata": {}

@ -28,7 +28,7 @@
"source": [
"# Peter and the Wolf: Reinforcement Learning Primer\n",
"\n",
"In this tutorial, we will learn how to apply Reinforcement learning to a problem of path finding. The setting is inspired by [Peter and the Wolf](https://en.wikipedia.org/wiki/Peter_and_the_Wolf) musical fairy tale by Russian composer [Segei Prokofiev](https://en.wikipedia.org/wiki/Sergei_Prokofiev). It is a story about young pioneer Peter, who bravely goes out of his house to the forest clearing to chase the wolf. We will train machine learning algorithms that will help Peter to explore the surroinding area and build an optimal navigation map.\n",
"In this tutorial, we will learn how to apply Reinforcement learning to a problem of path finding. The setting is inspired by [Peter and the Wolf](https://en.wikipedia.org/wiki/Peter_and_the_Wolf) musical fairy tale by Russian composer [Sergei Prokofiev](https://en.wikipedia.org/wiki/Sergei_Prokofiev). It is a story about young pioneer Peter, who bravely goes out of his house to the forest clearing to chase the wolf. We will train machine learning algorithms that will help Peter to explore the surroinding area and build an optimal navigation map.\n",
"\n",
"First, let's import a bunch of userful libraries:"
],
@ -128,13 +128,11 @@
},
{
"source": [
"The strategy of our agent (Peter) is defined by so-called **policy**. A policy is a function that returns the action at any given state. In our case, the state of the problem is represented by the board, including the current position of the player. \n",
"\n",
"The goal of reinforcement learning is to eventually learn a good policy that will allow us to solve the problem efficiently. However, as a baseline, let's consider the simplest policy called **random walk**.\n",
"The strategy of our agent (Peter) is defined by a so-called **policy**. Let's consider the simplest policy called **random walk**.\n",
"\n",
"## Random walk\n",
"\n",
"Let's first solve our problem by implementing a random walk strategy. With random walk, we will randomly chose the next action from allowed ones, until we reach the apple. "
"Let's first solve our problem by implementing a random walk strategy."
],
"cell_type": "markdown",
"metadata": {}
@ -223,7 +221,7 @@
"source": [
"## Reward Function\n",
"\n",
"To make out policy more intelligent, we need to understand which moves are \"better\" than others. To do this, we need to define our goal. The goal can be defined in terms of **reward function**, that will return some score value for each state. The higher the number - the better is the reward function\n",
"To make our policy more intelligent, we need to understand which moves are \"better\" than others.\n",
"\n"
],
"cell_type": "markdown",
@ -253,13 +251,9 @@
},
{
"source": [
"Interesting thing about reward function is that in most of the cases *we are only given substantial reward at the end of the game*. It means that out algorithm should somehow remember \"good\" steps that lead to positive reward at the end, and increase their importance. Similarly, all moves that lead to bad results should be discouraged.\n",
"\n",
"## Q-Learning\n",
"\n",
"An algorithm that we will discuss here is called **Q-Learning**. In this algorithm, the policy is defined by a function (or a data structure) called **Q-Table**. It records the \"goodness\" of each of the actions in a given state, i.e. $Q : {S\\times A}\\to\\mathbb{R}$, where $S$ is a set of states, $A$ is the set of actions.\n",
"\n",
"It is called Q-Table because it is often convenient to represent it as a table, or multi-dimensional array. Since our board has dimentions `width` x `height`, we can represent Q-Table by a numpy array with shape `width` x `height` x `len(actions)`:"
"Build a Q-Table, or multi-dimensional array. Since our board has dimentions `width` x `height`, we can represent Q-Table by a numpy array with shape `width` x `height` x `len(actions)`:"
],
"cell_type": "markdown",
"metadata": {}
@ -275,7 +269,7 @@
},
{
"source": [
"Notice that we initially initialize all values of Q-Table with equal value, in our case - 0.25. That corresponds to the \"random walk\" policy, because all moves in each state are equally good. We can pass the Q-Table to the `plot` function in order to visualize the table on the board:"
"Pass the Q-Table to the plot function in order to visualize the table on the board:"
],
"cell_type": "markdown",
"metadata": {}
@ -301,38 +295,11 @@
"m.plot(Q)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"source": [
"In the center of each cell there is an \"arrow\" that indicates the preferred direction of movement. Since all directions are equal, a dot is displayed.\n",
"\n",
"Now we need to run the simulation, explore our environment, and learn better distribution of Q-Table values, which will allow us to find the path to the apple much faster.\n",
"\n",
"## Essence of Q-Learning: Bellman Equation\n",
"\n",
"Once we start moving, each action will have a corresponding reward, i.e. we can theoretically select the next action based on the highest immediate reward. However, in most of the states the move will not achieve our goal or reaching the apple, and thus we cannot immediately decide which direction is better.\n",
"\n",
"> It is not the immediate result that matters, but rather the final result, which we will obtain at the end of the simulation.\n",
"\n",
"In order to account for this delayed reward, we need to use the principles of **[dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming)**, which allows us to think about out problem recursively.\n",
"## Essence of Q-Learning: Bellman Equation and Learning Algorithm\n",
"\n",
"Suppose we are now at the state $s$, and we want to move to the next state $s'$. By doing so, we will receive the immediate reward $r(s,a)$, defined by reward function, plus some future reward. If we suppose that our Q-Table correctly reflects the \"attractiveness\" of each action, then at state $s'$ we will chose an action $a'$ that corresponds to maximum value of $Q(s',a')$. Thus, the best possible future reward we could get at state $s'$ will be defined as $\\max_{a'}Q(s',a')$ (maximum here is computed over all possible actions $a'$ at state $s'$). \n",
"\n",
"This gives the **Bellman formula** for calculating the value of Q-Table at state $s$, given action $a$:\n",
"\n",
"$$Q(s,a) = r(s,a) + \\gamma \\max_{a'} Q(s',a')$$\n",
"\n",
"Here $\\gamma$ is so-called **discount factor** that determines to which extent you should prefer current reward over the future reward and vice versa.\n",
"\n",
"## Learning Algorithm\n",
"\n",
"Given the equation above, we can now write a pseudo-code for our leaning algorithm:\n",
"Write a pseudo-code for our leaning algorithm:\n",
"\n",
"* Initialize Q-Table Q with equal numbers for all states and actions\n",
"* Set learning rate $\\alpha\\leftarrow 1$\n",
@ -349,9 +316,7 @@
"\n",
"## Exploit vs. Explore\n",
"\n",
"In the algorithm above, we did not specify how exactly we should chose an action at step 2.1. If we are choosing the action randomly, we will randomly **explore** the environment, and we are quite likely to die often, and also explore such areas where we would not normally go. An alternative approach would be to **exploit** the Q-Table values that we already know, and thus to chose the best action (with highers Q-Table value) at state $s$. This, however, will prevent us from exploring other states, and quite likely we might not find the optimal solution.\n",
"\n",
"Thus, the best approach is to balance between exploration and exploitation. This can be easily done by choosing the action at state $s$ with probabilities proportional to values in Q-Table. In the beginning, when Q-Table values are all the same, it would correspond to random selection, but as we learn more about our environment, we would be more likely to follow the optimal route, however, choosing the unexplored path once in a while.\n",
"The best approach is to balance between exploration and exploitation. As we learn more about our environment, we would be more likely to follow the optimal route, however, choosing the unexplored path once in a while.\n",
"\n",
"## Python Implementation\n",
"\n",
@ -374,7 +339,7 @@
},
{
"source": [
"We add small amount `eps` to the original vector in order to avoid division by 0 in the initial case, when all components of the vector are identical.\n",
"We add a small amount of `eps` to the original vector in order to avoid division by 0 in the initial case, when all components of the vector are identical.\n",
"\n",
"The actual learning algorithm we will run for 5000 experiments, also called **epochs**: "
],
@ -431,7 +396,7 @@
},
{
"source": [
"After executing this algorithm, Q-Table should be updated with values that define the attractiveness of different actions at each step. We can try to visualize Q-Table by plotting a vector at each cell that will point in the desired direction of movement. For simplicity, we draw small circle instead of arrow head."
"After executing this algorithm, the Q-Table should be updated with values that define the attractiveness of different actions at each step. Visualize the table here:"
],
"cell_type": "markdown",
"metadata": {}
@ -494,13 +459,11 @@
},
{
"source": [
"If you try the code above several times, you may notice that sometimes it just \"hangs\", and you need to press STOP button in the notebook to interrupt it. This happens because there could be situations when two states \"point\" to each other in terms of optimal Q-Value, in which case the agents ends up moving between those states indefinitely.\n",
"If you try the code above several times, you may notice that sometimes it just \"hangs\", and you need to press the STOP button in the notebook to interrupt it. \n",
"\n",
"> **Task 1:** Modify the `walk` function to limit the maximum length of path by a certain number of steps (say, 100), and watch the code above return this value from time to time.\n",
"\n",
"> **Task 2:** Modify the `walk` function so that it does not go back to the places where is has already been previously. This will prevent `walk` from looping, however, the agent can still end up being \"trapped\" in a location from which it is unable to escape.\n",
"\n",
"Better navigation policy would be the one that we have used during training, which combines exploitation and exploration. In this policy, we will select each action with a certain probability, proportional to the values in Q-Table. This strategy may still result in the agent returning back to the position it has already explored, but, as you can see from the code below, it results in very short average path to the desired location (remember that `print_statistics` runs the simulation 100 times): "
"> **Task 2:** Modify the `walk` function so that it does not go back to the places where is has already been previously. This will prevent `walk` from looping, however, the agent can still end up being \"trapped\" in a location from which it is unable to escape. "
],
"cell_type": "markdown",
"metadata": {}
@ -531,9 +494,7 @@
},
{
"source": [
"## Investigating Learning Process\n",
"\n",
"As we have mentioned, the learning process is a balance between exploration and exploration of gained knowledge about the structure of problem space. We have seen that the result of learning (the ability to help an agent to find short path to the goal) has improved, but it is also interesting to observe how the average path length behaves during the learning process: "
"## Investigating the Learning Process"
],
"cell_type": "markdown",
"metadata": {}
@ -585,9 +546,9 @@
{
"source": [
"## Exercise\n",
"#### More Realistic Peter and the Wolf World\n",
"#### A More Realistic Peter and the Wolf World\n",
"\n",
"In our situation, Peter was able to move around almost without getting tired or hungry. In more realistic world, we has to sit down and rest from time to time, and also to feed himself. Let's make our world more realistic, by implementing the following rules:\n",
"In our situation, Peter was able to move around almost without getting tired or hungry. In a more realistic world, he has to sit down and rest from time to time, and also to feed himself. Let's make our world more realistic by implementing the following rules:\n",
"\n",
"1. By moving from one place to another, Peter loses **energy** and gains some **fatigue**.\n",
"2. Peter can gain more energy by eating apples.\n",
@ -603,13 +564,6 @@
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
]
}
Loading…
Cancel
Save