@ -31,7 +31,7 @@ Each cell in this board can either be:
* **an apple**, which represents something Peter would be glad to find in order to feed himself
* **a wolf**, which is dangerous and should be avoided
There is a separate Python module, [`rlboard.py`](rlboard.py), which contains the code to work with this environment. Because this code is not important for understanding our concepts, we will just import the module and use it to create the sample board:
There is a separate Python module, [`rlboard.py`](rlboard.py), which contains the code to work with this environment. Because this code is not important for understanding our concepts, we will just import the module and use it to create the sample board (code block 1):
```python
from rlboard import *
@ -45,7 +45,8 @@ This code should print the picture of the environment similar to the one above.
## Actions and Policy
In our example, Peter's goal would be to find an apple, while avoiding the wolf and other obstacles. To do this, he can essentially walk around until he finds and apple. Therefore, at any position he can chose between one of the following actions: up, down, left and right. We will define those actions as a dictionary, and map them to pairs of corresponding coordinate changes. For example, moving right (`R`) would correspond to a pair `(1,0)`.
In our example, Peter's goal would be to find an apple, while avoiding the wolf and other obstacles. To do this, he can essentially walk around until he finds and apple. Therefore, at any position he can chose between one of the following actions: up, down, left and right. We will define those actions as a dictionary, and map them to pairs of corresponding coordinate changes. For example, moving right (`R`) would correspond to a pair `(1,0)`. (code block 2)
action_idx = { a : i for i,a in enumerate(actions.keys()) }
@ -57,7 +58,7 @@ The goal of reinforcement learning is to eventually learn a good policy that wil
## Random walk
Let's first solve our problem by implementing a random walk strategy. With random walk, we will randomly chose the next action from allowed ones, until we reach the apple.
Let's first solve our problem by implementing a random walk strategy. With random walk, we will randomly chose the next action from allowed ones, until we reach the apple (code block 3).
The call to `walk` should return us the length of corresponding path, which can vary from one run to another. We can run the walk experiment a number of times (say, 100), and print the resulting statistics:
The call to `walk` should return us the length of corresponding path, which can vary from one run to another. We can run the walk experiment a number of times (say, 100), and print the resulting statistics (code block 4):
```python
def print_statistics(policy):
@ -111,7 +112,7 @@ You can also see how Peter's movement looks like during random walk:
## Reward Function
To make out policy more intelligent, we need to understand which moves are "better" than others. To do this, we need to define our goal. The goal can be defined in terms of **reward function**, that will return some score value for each state. The higher the number - the better is the reward function
To make out policy more intelligent, we need to understand which moves are "better" than others. To do this, we need to define our goal. The goal can be defined in terms of **reward function**, that will return some score value for each state. The higher the number - the better is the reward function. (code block 5)
```python
move_reward = -0.1
@ -130,13 +131,13 @@ def reward(m,pos=None):
return move_reward
```
Interesting thing about reward function is that in most of the cases *we are only given substantial reward at the end of the game*. It means that out algorithm should somehow remember "good" steps that lead to positive reward at the end, and increase their importance. Similarly, all moves that lead to bad results should be discouraged.
An interesting thing about reward function is that in most of the cases *we are only given substantial reward at the end of the game*. It means that out algorithm should somehow remember "good" steps that lead to positive reward at the end, and increase their importance. Similarly, all moves that lead to bad results should be discouraged.
## Q-Learning
An algorithm that we will discuss here is called **Q-Learning**. In this algorithm, the policy is defined by a function (or a data structure) called **Q-Table**. It records the "goodness" of each of the actions in a given state.
It is called Q-Table because it is often convenient to represent it as a table, or multi-dimensional array. Since our board has dimensions `width` x `height`, we can represent Q-Table by a numpy array with shape `width` x `height` x `len(actions)`:
It is called Q-Table because it is often convenient to represent it as a table, or multi-dimensional array. Since our board has dimensions `width` x `height`, we can represent Q-Table by a numpy array with shape `width` x `height` x `len(actions)`: (code block 6)
@ -190,7 +191,7 @@ In the algorithm above, we did not specify how exactly we should choose an actio
Thus, the best approach is to balance between exploration and exploitation. This can be done by choosing the action at state *s* with probabilities proportional to values in Q-Table. In the beginning, when Q-Table values are all the same, it would correspond to a random selection, but as we learn more about our environment, we would be more likely to follow the optimal route while allowing the agent to choose the unexplored path once in a while.
## Python Implementation
Now we are ready to implement the learning algorithm. Before that, we also need some function that will convert arbitrary numbers in the Q-Table into a vector of probabilities for corresponding actions:
Now we are ready to implement the learning algorithm. Before that, we also need some function that will convert arbitrary numbers in the Q-Table into a vector of probabilities for corresponding actions: (code block 7)
```python
def probs(v,eps=1e-4):
@ -201,7 +202,7 @@ def probs(v,eps=1e-4):
We add a few `eps` to the original vector in order to avoid division by 0 in the initial case, when all components of the vector are identical.
The actual learning algorithm we will run for 5000 experiments, also called **epochs**:
The actual learning algorithm we will run for 5000 experiments, also called **epochs**: (code block 8)
```python
for epoch in range(5000):
@ -236,7 +237,7 @@ After executing this algorithm, Q-Table should be updated with values that defin
## Checking the Policy
Since Q-Table lists the "attractiveness" of each action at each state, it is quite easy to use it to define the efficient navigation in our world. In the simplest case, we can select the action corresponding to the highest Q-Table value:
Since Q-Table lists the "attractiveness" of each action at each state, it is quite easy to use it to define the efficient navigation in our world. In the simplest case, we can select the action corresponding to the highest Q-Table value: (code block 9)
```python
def qpolicy_strict(m):
@ -258,7 +259,7 @@ walk(m,qpolicy_strict)
## Navigation
Better navigation policy would be the one that we have used during training, which combines exploitation and exploration. In this policy, we will select each action with a certain probability, proportional to the values in Q-Table. This strategy may still result in the agent returning back to the position it has already explored, but, as you can see from the code below, it results in very short average path to the desired location (remember that `print_statistics` runs the simulation 100 times):
Better navigation policy would be the one that we have used during training, which combines exploitation and exploration. In this policy, we will select each action with a certain probability, proportional to the values in Q-Table. This strategy may still result in the agent returning back to the position it has already explored, but, as you can see from the code below, it results in very short average path to the desired location (remember that `print_statistics` runs the simulation 100 times): (code block 10)
"# Peter and the Wolf: Reinforcement Learning Primer\n",
"\n",
"In this tutorial, we will learn how to apply Reinforcement learning to a problem of path finding. The setting is inspired by [Peter and the Wolf](https://en.wikipedia.org/wiki/Peter_and_the_Wolf) musical fairy tale by Russian composer [Segei Prokofiev](https://en.wikipedia.org/wiki/Sergei_Prokofiev). It is a story about young pioneer Peter, who bravely goes out of his house to the forest clearing to chase the wolf. We will train machine learning algorithms that will help Peter to explore the surroinding area and build an optimal navigation map.\n",
"In this tutorial, we will learn how to apply Reinforcement learning to a problem of path finding. The setting is inspired by [Peter and the Wolf](https://en.wikipedia.org/wiki/Peter_and_the_Wolf) musical fairy tale by Russian composer [Sergei Prokofiev](https://en.wikipedia.org/wiki/Sergei_Prokofiev). It is a story about young pioneer Peter, who bravely goes out of his house to the forest clearing to chase a wolf. We will train machine learning algorithms that will help Peter to explore the surrounding area and build an optimal navigation map.\n",
"\n",
"First, let's import a bunch of userful libraries:"
"First, let's import a bunch of useful libraries:"
],
"cell_type": "markdown",
"metadata": {}
@ -58,23 +58,16 @@
"For simplicity, let's consider Peter's world to be a square board of size `width` x `height`. Each cell in this board can either be:\n",
"* **ground**, on which Peter and other creatures can walk\n",
"* **water**, on which you obviously cannot walk\n",
"* **a tree** or **grass** - a place where you cat take some rest\n",
"* **a tree** or **grass** - a place where you can rest\n",
"* **an apple**, which represents something Peter would be glad to find in order to feed himself\n",
"* **a wolf**, which is dangerous and should be avoided\n",
"\n",
"To work with the environment, we will define a class called `Board`. In order not to clutter this notebook too much, we have moved all code to work with the board into separate `rlboard` module, which we will now import. You may look inside this module to get more details about the internals of the implementation."
"action_idx = { a : i for i,a in enumerate(actions.keys()) }"
"# code block 2"
]
},
{
"source": [
"The strategy of our agent (Peter) is defined by so-called **policy**. A policy is a function that returns the action at any given state. In our case, the state of the problem is represented by the board, including the current position of the player. \n",
"The strategy of our agent (Peter) is defined by a so-called **policy**. A policy is a function that returns the action at any given state. In our case, the state of the problem is represented by the board, including the current position of the player. \n",
"\n",
"The goal of reinforcement learning is to eventually learn a good policy that will allow us to solve the problem efficiently. However, as a baseline, let's consider the simplest policy called **random walk**.\n",
"\n",
@ -139,57 +128,14 @@
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"tags": []
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"18"
]
},
"metadata": {},
"execution_count": 5
}
],
"source": [
"def random_policy(m):\n",
" return random.choice(list(actions))\n",
"\n",
"def walk(m,policy,start_position=None):\n",
" n = 0 # number of steps\n",
" # set initial position\n",
" if start_position:\n",
" m.human = start_position \n",
" else:\n",
" m.random_start()\n",
" while True:\n",
" if m.at() == Board.Cell.apple:\n",
" return n # success!\n",
" if m.at() in [Board.Cell.wolf, Board.Cell.water]:\n",
" return -1 # eaten by wolf or drowned\n",
" while True:\n",
" a = actions[policy(m)]\n",
" new_pos = m.move_pos(m.human,a)\n",
" if m.is_valid(new_pos) and m.at(new_pos)!=Board.Cell.water:\n",
" m.move(a) # do the actual move\n",
" break\n",
" n+=1\n",
"\n",
"walk(m,random_policy)"
]
},
{
"source": [
"Let's run random walk experiment several times and see the average number of steps taken:"
"# Let's run a random walk experiment several times and see the average number of steps taken: code block 3"
"To make out policy more intelligent, we need to understand which moves are \"better\" than others. To do this, we need to define our goal. The goal can be defined in terms of **reward function**, that will return some score value for each state. The higher the number - the better is the reward function\n",
"To make our policy more intelligent, we need to understand which moves are \"better\" than others. To do this, we need to define our goal. The goal can be defined in terms of **reward function**, that will return some score value for each state. The higher the number - the better the reward function is.\n",
"\n"
],
"cell_type": "markdown",
@ -235,25 +170,12 @@
"metadata": {},
"outputs": [],
"source": [
"move_reward = -0.1\n",
"goal_reward = 10\n",
"end_reward = -10\n",
"\n",
"def reward(m,pos=None):\n",
" pos = pos or m.human\n",
" if not m.is_valid(pos):\n",
" return end_reward\n",
" x = m.at(pos)\n",
" if x==Board.Cell.water or x == Board.Cell.wolf:\n",
" return end_reward\n",
" if x==Board.Cell.apple:\n",
" return goal_reward\n",
" return move_reward"
"#code block 5"
]
},
{
"source": [
"Interesting thing about reward function is that in most of the cases *we are only given substantial reward at the end of the game*. It means that out algorithm should somehow remember \"good\" steps that lead to positive reward at the end, and increase their importance. Similarly, all moves that lead to bad results should be discouraged.\n",
"An interesting thing about the reward function is that in most of the cases *we are only given substantial reward at the end of the game*. It means that our algorithm should somehow remember \"good\" steps that lead to a positive reward at the end and increase their importance. Similarly, all moves that lead to bad results should be discouraged.\n",
"In the center of each cell there is an \"arrow\" that indicates the preferred direction of movement. Since all directions are equal, a dot is displayed.\n",
@ -320,7 +235,7 @@
"\n",
"> It is not the immediate result that matters, but rather the final result, which we will obtain at the end of the simulation.\n",
"\n",
"In order to account for this delayed reward, we need to use the principles of **[dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming)**, which allows us to think about out problem recursively.\n",
"In order to account for this delayed reward, we need to use the principles of **[dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming)**, which allows us to think about our problem recursively.\n",
"\n",
"Suppose we are now at the state $s$, and we want to move to the next state $s'$. By doing so, we will receive the immediate reward $r(s,a)$, defined by reward function, plus some future reward. If we suppose that our Q-Table correctly reflects the \"attractiveness\" of each action, then at state $s'$ we will chose an action $a'$ that corresponds to maximum value of $Q(s',a')$. Thus, the best possible future reward we could get at state $s'$ will be defined as $\\max_{a'}Q(s',a')$ (maximum here is computed over all possible actions $a'$ at state $s'$). \n",
"\n",
@ -366,10 +281,7 @@
"metadata": {},
"outputs": [],
"source": [
"def probs(v,eps=1e-4):\n",
" v = v-v.min()+eps\n",
" v = v/v.sum()\n",
" return v"
"# code block 7"
]
},
{
@ -400,41 +312,17 @@
"\n",
"lpath = []\n",
"\n",
"for epoch in range(10000):\n",
" clear_output(wait=True)\n",
" print(f\"Epoch = {epoch}\",end='')\n",
"\n",
" # Pick initial point\n",
" m.random_start()\n",
" \n",
" # Start travelling\n",
" n=0\n",
" cum_reward = 0\n",
" while True:\n",
" x,y = m.human\n",
" v = probs(Q[x,y])\n",
" a = random.choices(list(actions),weights=v)[0]\n",
"After executing this algorithm, Q-Table should be updated with values that define the attractiveness of different actions at each step. We can try to visualize Q-Table by plotting a vector at each cell that will point in the desired direction of movement. For simplicity, we draw small circle instead of arrow head."
"After executing this algorithm, the Q-Table should be updated with values that define the attractiveness of different actions at each step. We can try to visualize the Q-Table by plotting a vector at each cell that will point in the desired direction of movement. For simplicity, we draw a small circle instead of arrow head."
],
"cell_type": "markdown",
"metadata": {}
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
@ -483,18 +371,12 @@
}
],
"source": [
"def qpolicy_strict(m):\n",
" x,y = m.human\n",
" v = probs(Q[x,y])\n",
" a = list(actions)[np.argmax(v)]\n",
" return a\n",
"\n",
"walk(m,qpolicy_strict)"
"# code block 9"
]
},
{
"source": [
"If you try the code above several times, you may notice that sometimes it just \"hangs\", and you need to press STOP button in the notebook to interrupt it. This happens because there could be situations when two states \"point\" to each other in terms of optimal Q-Value, in which case the agents ends up moving between those states indefinitely.\n",
"If you try the code above several times, you may notice that sometimes it just \"hangs\", and you need to press the STOP button in the notebook to interrupt it. This happens because there could be situations when two states \"point\" to each other in terms of optimal Q-Value, in which case the agents ends up moving between those states indefinitely.\n",
"\n",
"> **Task 1:** Modify the `walk` function to limit the maximum length of path by a certain number of steps (say, 100), and watch the code above return this value from time to time.\n",
"\n",
@ -502,8 +384,10 @@
"\n",
"Better navigation policy would be the one that we have used during training, which combines exploitation and exploration. In this policy, we will select each action with a certain probability, proportional to the values in Q-Table. This strategy may still result in the agent returning back to the position it has already explored, but, as you can see from the code below, it results in very short average path to the desired location (remember that `print_statistics` runs the simulation 100 times): "
],
"cell_type": "markdown",
"metadata": {}
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
@ -520,13 +404,7 @@
],
"source": [
"\n",
"def qpolicy(m):\n",
" x,y = m.human\n",
" v = probs(Q[x,y])\n",
" a = random.choices(list(actions),weights=v)[0]\n",