Add reinforcement learning, part 1

4 years ago · 13d2a03710
parent 7163906895
commit 13d2a03710
25 changed files with 1764 additions and 69 deletions
--- a/8-Reinforcement/1-Concepts/README.md
+++ b/8-Reinforcement/1-Concepts/README.md
@ -1,55 +0,0 @@
-# [Lesson Topic]
-
-Add a sketchnote if possible/appropriate
-
-![Embed a video here if available](video-url)
-
-## [Pre-lecture quiz](link-to-quiz-app)
-
-Describe what we will learn
-
-### Introduction
-
-Describe what will be covered
-
-> Notes
-
-### Prerequisite
-
-What steps should have been covered before this lesson?
-
-### Preparation
-
-Preparatory steps to start this lesson
-
---
-
-[Step through content in blocks]
-
-## [Topic 1]
-
-### Task:
-
-Work together to progressively enhance your codebase to build the project with shared code:
-
-```html
-code blocks
-```
-
-✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
-
-## [Topic 2]
-
-## [Topic 3]
-
-## 🚀Challenge
-
- Add a challenge for students to work on collaboratively in class to enhance the project
-
-Optional: add a screenshot of the completed lesson's UI if appropriate
-
-## [Post-lecture quiz](link-to-quiz-app)
-
-## Review & Self Study
-
-## Assignment [Assignment Name](assignment.md)
--- a/8-Reinforcement/1-QLearning/MazeLearner.ipynb
+++ b/8-Reinforcement/1-QLearning/MazeLearner.ipynb
--- a/8-Reinforcement/1-QLearning/README.md
+++ b/8-Reinforcement/1-QLearning/README.md
@ -0,0 +1,288 @@
+# Introduction to Reinforcement Learning and Q-Learning
+## [Pre-lecture quiz](link-to-quiz-app)
+
+In this lesson, we will explore the world of **[Peter and the Wolf](https://en.wikipedia.org/wiki/Peter_and_the_Wolf)**, inspired by a musical fairy tale by a Russian composer, [Sergei Prokofiev](https://en.wikipedia.org/wiki/Sergei_Prokofiev). We will use **Reinforcement Learning** to let Peter explore his environment, collect tasty apples and avoid meeting the wolf.
+
+### Prerequisites and Setup
+
+In this lesson, we will be experimenting with some code in Python. So you are expected to be able to run the Jupyter Notebook code from this lesson, either on your computer, or somewhere in the cloud.
+
+You can open [the lesson notebook](MazeLearner.ipynb) and continue reading the material there, or continue reading here, and run the code in your favorite Python environment. 
+
+> **Note:** If you are opening this code from the cloud, you also need to fetch [`rlboard.py`](rlboard.py) file, because notebook code uses it. Put it into the same directory with the notebook.
+## Introduction
+
+**Reinforcement Learning** (RL) is a learning technique that allows us to learn an optimal behavior of an **agent** in some **environment** by running many experiments. An agent in this environment should have some **goal**, defined by a **reward function**.
+## The Environment
+
+For simplicity, let's consider Peter's world to be a square board of size `width` x `height`, like this: 
+
+![Peter's Environment](images/environment.png)
+
+Each cell in this board can either be:
+
+* **ground**, on which Peter and other creatures can walk
+* **water**, on which you obviously cannot walk
+* **a tree** or **grass** - a place where you can take some rest
+* **an apple**, which represents something Peter would be glad to find in order to feed himself
+* **a wolf**, which is dangerous and should be avoided
+
+There is a separate Python module, [`rlboard.py`](rlboard.py), which contains the code to work with this environment. Because this code is not important for understanding our concepts, we will just import the module and use it to create the sample board:
+```python
+from rlboard import *
+
+width, height = 8,8
+m = Board(width,height)
+m.randomize(seed=13)
+m.plot()
+```
+This code should print the picture of the environment similar to the one above. 
+
+## Actions and Policy
+
+
+In our example, Peter's goal would be to find an apple, while avoiding the wolf and other obstacles. To do this, he can essentially walk around until he finds and apple. Therefore, at any position he can chose between one of the following actions: up, down, left and right. We will define those actions as a dictionary, and map them to pairs of corresponding coordinate changes. For example, moving right (`R`) would correspond to a pair `(1,0)`.
+```python
+actions = { "U" : (0,-1), "D" : (0,1), "L" : (-1,0), "R" : (1,0) }
+action_idx = { a : i for i,a in enumerate(actions.keys()) }
+```
+
+The strategy of our agent (Peter) is defined by so-called **policy**. A policy is a function that returns the action at any given state. In our case, the state of the problem is represented by the board, including the current position of the player. 
+
+The goal of reinforcement learning is to eventually learn a good policy that will allow us to solve the problem efficiently. However, as a baseline, let's consider the simplest policy called **random walk**.
+
+## Random walk
+
+Let's first solve our problem by implementing a random walk strategy. With random walk, we will randomly chose the next action from allowed ones, until we reach the apple. 
+
+```python
+def random_policy(m):
+    return random.choice(list(actions))
+
+def walk(m,policy,start_position=None):
+    n = 0 # number of steps
+    # set initial position
+    if start_position:
+        m.human = start_position 
+    else:
+        m.random_start()
+    while True:
+        if m.at() == Board.Cell.apple:
+            return n # success!
+        if m.at() in [Board.Cell.wolf, Board.Cell.water]:
+            return -1 # eaten by wolf or drowned
+        while True:
+            a = actions[policy(m)]
+            new_pos = m.move_pos(m.human,a)
+            if m.is_valid(new_pos) and m.at(new_pos)!=Board.Cell.water:
+                m.move(a) # do the actual move
+                break
+        n+=1
+
+walk(m,random_policy)
+```
+
+The call to `walk` should return us the length of corresponding path, which can vary from one run to another. We can run the walk experiment a number of times (say, 100), and print the resulting statistics:
+
+```python
+def print_statistics(policy):
+    s,w,n = 0,0,0
+    for _ in range(100):
+        z = walk(m,policy)
+        if z<0:
+            w+=1
+        else:
+            s += z
+            n += 1
+    print(f"Average path length = {s/n}, eaten by wolf: {w} times")
+
+print_statistics(random_policy)
+```
+
+Note that the average length of a path is around 30-40 steps, which is quite a lot, given the fact that the average distance to the nearest apple is around 5-6 steps.
+
+You can also see how Peter's movement looks like during random walk:
+
+![Peter's Random Walk](images/random_walk.gif)
+
+## Reward Function
+
+To make out policy more intelligent, we need to understand which moves are "better" than others. To do this, we need to define our goal. The goal can be defined in terms of **reward function**, that will return some score value for each state. The higher the number - the better is the reward function
+
+```python
+move_reward = -0.1
+goal_reward = 10
+end_reward = -10
+
+def reward(m,pos=None):
+    pos = pos or m.human
+    if not m.is_valid(pos):
+        return end_reward
+    x = m.at(pos)
+    if x==Board.Cell.water or x == Board.Cell.wolf:
+        return end_reward
+    if x==Board.Cell.apple:
+        return goal_reward
+    return move_reward
+```
+
+Interesting thing about reward function is that in most of the cases *we are only given substantial reward at the end of the game*. It means that out algorithm should somehow remember "good" steps that lead to positive reward at the end, and increase their importance. Similarly, all moves that lead to bad results should be discouraged.
+
+## Q-Learning
+
+An algorithm that we will discuss here is called **Q-Learning**. In this algorithm, the policy is defined by a function (or a data structure) called **Q-Table**. It records the "goodness" of each of the actions in a given state.
+
+It is called Q-Table because it is often convenient to represent it as a table, or multi-dimensional array. Since our board has dimentions `width` x `height`, we can represent Q-Table by a numpy array with shape `width` x `height` x `len(actions)`:
+
+```python
+Q = np.ones((width,height,len(actions)),dtype=np.float)*1.0/len(actions)
+```
+
+Notice that we initially initialize all values of Q-Table with equal value, in our case - 0.25. That corresponds to the "random walk" policy, because all moves in each state are equally good. We can pass the Q-Table to the `plot` function in order to visualize the table on the board: `m.plot(Q)`.
+
+![Peter's Environment](images/env_init.png)
+
+In the center of each cell there is an "arrow" that indicates the preferred direction of movement. Since all directions are equal, a dot is displayed.
+
+Now we need to run the simulation, explore our environment, and learn better distribution of Q-Table values, which will allow us to find the path to the apple much faster.
+
+## Essence of Q-Learning: Bellman Equation
+
+Once we start moving, each action will have a corresponding reward, i.e. we can theoretically select the next action based on the highest immediate reward. However, in most of the states the move will not achieve our goal or reaching the apple, and thus we cannot immediately decide which direction is better.
+
+> It is not the immediate result that matters, but rather the final result, which we will obtain at the end of the simulation.
+
+In order to account for this delayed reward, we need to use the principles of **[dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming)**, which allows us to think about out problem recursively.
+
+Suppose we are now at the state *s*, and we want to move to the next state *s'*. By doing so, we will receive the immediate reward *r(s,a)*, defined by reward function, plus some future reward. If we suppose that our Q-Table correctly reflects the "attractiveness" of each action, then at state *s'* we will chose an action *a* that corresponds to maximum value of *Q(s',a')*. Thus, the best possible future reward we could get at state *s* will be defined as `max`<sub>a'</sub>*Q(s',a')* (maximum here is computed over all possible actions *a'* at state *s'*. 
+
+This gives the **Bellman formula** for calculating the value of Q-Table at state *s*, given action *a*:
+
+<img src="images/bellmaneq.gif"/>
+
+Here γ is so-called **discount factor** that determines to which extent you should prefer current reward over the future reward and vice versa.
+
+## Learning Algorithm
+
+Given the equation above, we can now write a pseudo-code for our leaning algorithm:
+
+* Initialize Q-Table Q with equal numbers for all states and actions
+* Set learning rate α ← 1
+* Repeat simulation many times
+   1. Start at random position
+   1. Repeat
+        1. Select an action *a* at state *s*
+        2. Exectute action by moving to a new state *s'*
+        3. If we encounter end-of-game condition, or total reward is too small - exit simulation  
+        4. Compute reward *r* at the new state
+        5. Update Q-Function according to Bellman equation: *Q(s,a)* ← *(1-α)Q(s,a)+α(r+γ max<sub>a'</sub>Q(s',a'))*
+        6. *s* ← *s'*
+        7. Update total reward and decrease α.
+
+## Exploit vs. Explore
+
+In the algorithm above, we did not specify how exactly we should chose an action at step 2.1. If we are choosing the action randomly, we will randomly **explore** the environment, and we are quite likely to die often, and also explore such areas where we would not normally go. An alternative approach would be to **exploit** the Q-Table values that we already know, and thus to chose the best action (with highers Q-Table value) at state *s*. This, however, will prevent us from exploring other states, and quite likely we might not find the optimal solution.
+
+Thus, the best approach is to balance between exploration and exploitation. This can be easily done by choosing the action at state *s* with probabilities proportional to values in Q-Table. In the beginning, when Q-Table values are all the same, it would correspond to random selection, but as we learn more about our environment, we would be more likely to follow the optimal route, however, choosing the unexplored path once in a while.
+
+## Python Implementation
+
+Now we are ready to implement the learning algorithm. Before that, we also need some function that will convert arbitrary numbers in the Q-Table into a vector of probabilities for corresponding actions:
+
+```python
+def probs(v,eps=1e-4):
+    v = v-v.min()+eps
+    v = v/v.sum()
+    return v
+```
+
+We add small amount `eps` to the original vector in order to avoid division by 0 in the initial case, when all components of the vector are identical.
+
+The actual learning algorithm we will run for 5000 experiments, also called **epochs**: 
+
+```python
+for epoch in range(5000):
+
+    # Pick initial point
+    m.random_start()
+    
+    # Start travelling
+    n=0
+    cum_reward = 0
+    while True:
+        x,y = m.human
+        v = probs(Q[x,y])
+        a = random.choices(list(actions),weights=v)[0]
+        dpos = actions[a]
+        m.move(dpos)
+        r = reward(m)
+        cum_reward += r
+        if r==end_reward or cum_reward < -1000:
+            lpath.append(n)
+            break
+        alpha = np.exp(-n / 10e5)
+        gamma = 0.5
+        ai = action_idx[a]
+        Q[x,y,ai] = (1 - alpha) * Q[x,y,ai] + alpha * (r + gamma * Q[x+dpos[0], y+dpos[1]].max())
+        n+=1
+```
+
+After executing this algorithm, Q-Table should be updated with values that define the attractiveness of different actions at each step. We can try to visualize Q-Table by plotting a vector at each cell that will point in the desired direction of movement. For simplicity, we draw small circle instead of arrow head.
+
+<img src="images/learned.png"/>
+
+## Checking the Policy
+
+Since Q-Table lists the "attractiveness" of each action at each state, it is quite easy to use it to define the efficient navigation in our world. In the simplest case, we can just select the action corresponding to the highest Q-Table value:
+
+```python
+def qpolicy_strict(m):
+        x,y = m.human
+        v = probs(Q[x,y])
+        a = list(actions)[np.argmax(v)]
+        return a
+
+walk(m,qpolicy_strict)
+```
+
+If you try the code above several times, you may notice that sometimes it just "hangs", and you need to press STOP button in the notebook to interrupt it. This happens because there could be situations when two states "point" to each other in terms of optimal Q-Value, in which case the agents ends up moving between those states indefinitely.
+
+## 🚀Challenge
+
+> **Task 1:** Modify the `walk` function to limit the maximum length of path by a certain number of steps (say, 100), and watch the code above return this value from time to time.
+
+> **Task 2:** Modify the `walk` function so that it does not go back to the places where is has already been previously. This will prevent `walk` from looping, however, the agent can still end up being "trapped" in a location from which it is unable to escape.
+
+## Navigation
+
+Better navigation policy would be the one that we have used during training, which combines exploitation and exploration. In this policy, we will select each action with a certain probability, proportional to the values in Q-Table. This strategy may still result in the agent returning back to the position it has already explored, but, as you can see from the code below, it results in very short average path to the desired location (remember that `print_statistics` runs the simulation 100 times):  
+
+```python
+def qpolicy(m):
+        x,y = m.human
+        v = probs(Q[x,y])
+        a = random.choices(list(actions),weights=v)[0]
+        return a
+
+print_statistics(qpolicy)
+```
+
+After running this code, you should get much smaller average path length than before, in the range 3-6...
+
+## Investigating Learning Process
+
+As we have mentioned, the learning process is a balance between exploration and exploration of gained knowledge about the structure of problem space. We have seen that the result of learning (the ability to help an agent to find short path to the goal) has improved, but it is also interesting to observe how the average path length behaves during the learning process: 
+
+<img src="images/lpathlen1.png"/>
+
+What we see here is that at first the average path length increased. This is probably due to the fact that when we know nothing about the environment - we are likely to get trapped into bad states, water or wolf. As we learn more and start using this knowledge, we can explore the environment for longer, but we still do not know well where apples are.
+
+Once we learn enough, it becomes easier for the agent to achieve the goal, and the path length starts to decrease. However, we are still open to exploration, so we often diverge away from the best path, and explore new options, making the path longer than optimal.
+
+What we also observe on this graph, is that at some point the length increased abruptly. This indicates stochastic nature of the process, and that we can at some point "sploil" the Q-Table coefficients, by overwriting them with new values. This ideally should be minimized by decreasing learning rate (i.e. towards the end of training we only adjust Q-Table values by a small value).
+
+Overall, it is important to remember that the success and quality of the learning process significantly depends on parameters, such as leaning rate, learning rate decay and discount factor. Those are often called **hyperparameters**, to distinguish them from **parameters** which we optimize during training (eg. Q-Table coefficients). The process of finding best hyperparameter values is called **hyperparameter optimization**, and it deserves a separate topic.
+
+## [Post-lecture quiz](link-to-quiz-app)
+
+## Assignment [Assignment Name](assignment.md)
--- a/8-Reinforcement/1-QLearning/assignment.md
+++ b/8-Reinforcement/1-QLearning/assignment.md
@ -0,0 +1,25 @@
+# More Realistic Peter and the Wolf World
+
+In our situation, Peter was able to move around almost without getting tired or hungry. In more realistic world, we has to sit down and rest from time to time, and also to feed himself. Let's make our world more realistic, by implementing the following rules:
+
+1. By moving from one place to another, Peter loses **energy** and gains some **fatigue**.
+2. Peter can gain more energy by eating apples.
+3. Peter can get rid of fatigue by resting under the tree or on the grass (i.e. walking into a board location with a tree or grass - green field)
+4. Peter needs to find and kill the wolf
+5. In order to kill the wolf, Peter needs to have certain levels of energy and fatigue, otherwise he loses the battle.
+## Instructions
+
+Use original [MazeLearner.ipynb](MazeLearner.ipynb) notebook as a starting point for your solution.
+
+Modify the reward function above according to the rules of the game, run the reinforcement learning algorithm to learn the best strategy for winning the game, and compare the results of random walk with your algorithm in terms of number of games won and lost.
+
+> **Note**: In your new world, the state is more complex, and in addition to human position also includes fatigue and energy levels. You may chose to represent the state as a tuple (Board,energy,fatigue), or define a class for the state (you may also want to derive it from `Board`), or even modify the original `Board` class inside [rlboard.py](rlboard.py).
+
+In your solution, please keep the code responsible for random walk strategy, and compare the results of your algorithm with random walk at the end.
+
+> **Note**: You may need to adjust hyperparameters to make it work, especially the number of epochs. Because the success of the game (fighting the wolf) is a rare event, you can expect much longer training time.
+## Rubric
+
+| Criteria | Exemplary | Adequate | Needs Improvement |
+| -------- | --------- | -------- | ----------------- |
+|          | A notebook is presented with the definition of new world rules, Q-Learning algorithm and some textual explanations. Q-Learning is able to significantly improve the results comparing to random walk. | Notebook is presented, Q-Learning is implemented and improves results comparing to random walk, but not significantly; or notebook is poorly documented and code is not well-structured | Some attempt to re-define the rules of the world are made, but Q-Learning algorithm does not work, or reward function is not fully defined |
--- a/8-Reinforcement/1-QLearning/images/apple.png
+++ b/8-Reinforcement/1-QLearning/images/apple.png
--- a/8-Reinforcement/1-QLearning/images/bellmaneq.gif
+++ b/8-Reinforcement/1-QLearning/images/bellmaneq.gif
--- a/8-Reinforcement/1-QLearning/images/env_init.png
+++ b/8-Reinforcement/1-QLearning/images/env_init.png
--- a/8-Reinforcement/1-QLearning/images/environment.png
+++ b/8-Reinforcement/1-QLearning/images/environment.png
--- a/8-Reinforcement/1-QLearning/images/human.png
+++ b/8-Reinforcement/1-QLearning/images/human.png
--- a/8-Reinforcement/1-QLearning/images/learned.png
+++ b/8-Reinforcement/1-QLearning/images/learned.png
--- a/8-Reinforcement/1-QLearning/images/lpathlen.png
+++ b/8-Reinforcement/1-QLearning/images/lpathlen.png
--- a/8-Reinforcement/1-QLearning/images/lpathlen1.png
+++ b/8-Reinforcement/1-QLearning/images/lpathlen1.png
--- a/8-Reinforcement/1-QLearning/images/qwalk.gif
+++ b/8-Reinforcement/1-QLearning/images/qwalk.gif
--- a/8-Reinforcement/1-QLearning/images/random_walk.gif
+++ b/8-Reinforcement/1-QLearning/images/random_walk.gif
--- a/8-Reinforcement/1-QLearning/images/wolf.png
+++ b/8-Reinforcement/1-QLearning/images/wolf.png
--- a/8-Reinforcement/1-QLearning/rlboard.py
+++ b/8-Reinforcement/1-QLearning/rlboard.py
@ -0,0 +1,194 @@
+# Maze simulation environment for Reinforcement Learning tutorial
+# by Dmitry Soshnikov
+# http://soshnikov.com
+
+import matplotlib.pyplot as plt
+import numpy as np
+import cv2
+import random
+import math
+
+def clip(min,max,x):
+    if x<min:
+        return min
+    if x>max:
+        return max
+    return x
+
+def imload(fname,size):
+    img = cv2.imread(fname)
+    img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
+    img = cv2.resize(img,(size,size),interpolation=cv2.INTER_LANCZOS4)
+    img = img / np.max(img)
+    return img
+
+def draw_line(dx,dy,size=50):
+    p=np.ones((size-2,size-2,3))
+    if dx==0:
+        dx=0.001
+    m = (size-2)//2
+    l = math.sqrt(dx*dx+dy*dy)*(size-4)/2
+    a = math.atan(dy/dx)
+    cv2.line(p,(int(m-l*math.cos(a)),int(m-l*math.sin(a))),(int(m+l*math.cos(a)),int(m+l*math.sin(a))),(0,0,0),1)
+    s = -1 if dx<0 else 1
+    cv2.circle(p,(int(m+s*l*math.cos(a)),int(m+s*l*math.sin(a))),3,0)
+    return p   
+
+def probs(v):
+    v = v-v.min()
+    if (v.sum()>0):
+        v = v/v.sum()
+    return v
+
+class Board:
+    class Cell:
+        empty = 0
+        water = 1
+        wolf = 2
+        tree = 3
+        apple = 4
+    def __init__(self,width,height,size=50):
+        self.width = width
+        self.height = height
+        self.size = size+2
+        self.matrix = np.zeros((width,height))
+        self.grid_color = (0.6,0.6,0.6)
+        self.background_color = (1.0,1.0,1.0)
+        self.grid_thickness = 1
+        self.grid_line_type = cv2.LINE_AA
+        self.pics = {
+            "wolf" : imload('images/wolf.png',size-4),
+            "apple" : imload('images/apple.png',size-4),
+            "human" : imload('images/human.png',size-4)
+        }
+        self.human = (0,0)
+        self.frame_no = 0
+
+    def randomize(self,water_size=5, num_water=3, num_wolves=1, num_trees=5, num_apples=3,seed=None):
+        if seed:
+            random.seed(seed)
+        for _ in range(num_water):
+            x = random.randint(0,self.width-1)
+            y = random.randint(0,self.height-1)
+            for _ in range(water_size):
+                self.matrix[x,y] = Board.Cell.water
+                x = clip(0,self.width-1,x+random.randint(-1,1))
+                y = clip(0,self.height-1,y+random.randint(-1,1))
+        for _ in range(num_trees):
+            while True:
+                x = random.randint(0,self.width-1)
+                y = random.randint(0,self.height-1)
+                if self.matrix[x,y]==Board.Cell.empty:
+                    self.matrix[x,y] = Board.Cell.tree # tree
+                    break
+        for _ in range(num_wolves):
+            while True:
+                x = random.randint(0,self.width-1)
+                y = random.randint(0,self.height-1)
+                if self.matrix[x,y]==Board.Cell.empty:
+                    self.matrix[x,y] = Board.Cell.wolf # wolf
+                    break
+        for _ in range(num_apples):
+            while True:
+                x = random.randint(0,self.width-1)
+                y = random.randint(0,self.height-1)
+                if self.matrix[x,y]==Board.Cell.empty:
+                    self.matrix[x,y] = Board.Cell.apple
+                    break
+
+    def at(self,pos=None):
+        if pos:
+            return self.matrix[pos[0],pos[1]]
+        else:
+            return self.matrix[self.human[0],self.human[1]]
+
+    def is_valid(self,pos):
+        return pos[0]>=0 and pos[0]<self.width and pos[1]>=0 and pos[1] < self.height
+
+    def move_pos(self, pos, dpos):
+        return (pos[0] + dpos[0], pos[1] + dpos[1])
+
+    def move(self,dpos):
+        new_pos = self.move_pos(self.human,dpos)
+        self.human = new_pos
+
+    def random_pos(self):
+        x = random.randint(0,self.width-1)
+        y = random.randint(0,self.height-1)
+        return (x,y)
+
+    def random_start(self):
+        while True:
+            pos = self.random_pos()
+            if self.at(pos) == Board.Cell.empty:
+                self.human = pos
+                break
+
+
+    def image(self,Q=None):
+        img = np.zeros((self.height*self.size+1,self.width*self.size+1,3))
+        img[:,:,:] = self.background_color
+        # Draw water
+        for x in range(self.width):
+            for y in range(self.height):
+                if (x,y) == self.human:
+                    ov = self.pics['human']
+                    img[self.size*y+2:self.size*y+ov.shape[0]+2,self.size*x+2:self.size*x+2+ov.shape[1],:] = np.minimum(ov,1.0)
+                    continue
+                if self.matrix[x,y] == Board.Cell.water:
+                    img[self.size*y:self.size*(y+1),self.size*x:self.size*(x+1),:] = (0,0,1.0)
+                if self.matrix[x,y] == Board.Cell.wolf:
+                    ov = self.pics['wolf']
+                    img[self.size*y+2:self.size*y+ov.shape[0]+2,self.size*x+2:self.size*x+2+ov.shape[1],:] = np.minimum(ov,1.0)
+                if self.matrix[x,y] == Board.Cell.apple: # apple
+                    ov = self.pics['apple']
+                    img[self.size*y+2:self.size*y+ov.shape[0]+2,self.size*x+2:self.size*x+2+ov.shape[1],:] = np.minimum(ov,1.0)
+                if self.matrix[x,y] == Board.Cell.tree: # tree
+                    img[self.size*y:self.size*(y+1),self.size*x:self.size*(x+1),:] = (0,1.0,0)
+                if self.matrix[x,y] == Board.Cell.empty and Q is not None:
+                    p = probs(Q[x,y])
+                    dx,dy = 0,0
+                    for i,(ddx,ddy) in enumerate([(-1,0),(1,0),(0,-1),(0,1)]):
+                        dx += ddx*p[i]
+                        dy += ddy*p[i]
+                        l = draw_line(dx,dy,self.size)
+                        img[self.size*y+2:self.size*y+l.shape[0]+2,self.size*x+2:self.size*x+2+l.shape[1],:] = l
+
+        # Draw grid
+        for i in range(self.height+1):
+            img[:,i*self.size] = 0.3
+            #cv2.line(img,(0,i*self.size),(self.width*self.size,i*self.size), self.grid_color, self.grid_thickness,lineType=self.grid_line_type)
+        for j in range(self.width+1):
+            img[j*self.size,:] = 0.3
+            #cv2.line(img,(j*self.size,0),(j*self.size,self.height*self.size), self.grid_color, self.grid_thickness,lineType=self.grid_line_type)
+        return img
+
+    def plot(self,Q=None):
+        plt.figure(figsize=(11,6))
+        plt.imshow(self.image(Q),interpolation='hanning')
+
+    def saveimage(self,filename,Q=None):
+        cv2.imwrite(filename,255*self.image(Q)[...,::-1])
+
+    def walk(self,policy,save_to=None,start=None):
+        n = 0
+        if start:
+            self.human = start
+        else:
+            self.random_start()
+
+        while True:
+            if save_to:
+                self.saveimage(save_to.format(self.frame_no))
+                self.frame_no+=1
+            if self.at() == Board.Cell.apple:
+                return n # success!
+            if self.at() in [Board.Cell.wolf, Board.Cell.water]:
+                return -1 # eaten by wolf or drowned
+            while True:
+                a = policy(self)
+                new_pos = self.move_pos(self.human,a)
+                if self.is_valid(new_pos) and self.at(new_pos)!=Board.Cell.water:
+                    self.move(a) # do the actual move
+                    break
+            n+=1
--- a/8-Reinforcement/1-QLearning/solution/RealisticPeterAndWolf.ipynb
+++ b/8-Reinforcement/1-QLearning/solution/RealisticPeterAndWolf.ipynb
--- a/8-Reinforcement/1-QLearning/solution/rlboard.py
+++ b/8-Reinforcement/1-QLearning/solution/rlboard.py
@ -0,0 +1,195 @@
+# Maze simulation environment for Reinforcement Learning tutorial
+# by Dmitry Soshnikov
+# http://soshnikov.com
+
+import matplotlib.pyplot as plt
+import numpy as np
+import cv2
+import random
+import math
+
+def clip(min,max,x):
+    if x<min:
+        return min
+    if x>max:
+        return max
+    return x
+
+def imload(fname,size):
+    img = cv2.imread(fname)
+    img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
+    img = cv2.resize(img,(size,size),interpolation=cv2.INTER_LANCZOS4)
+    img = img / np.max(img)
+    return img
+
+def draw_line(dx,dy,size=50):
+    p=np.ones((size-2,size-2,3))
+    if dx==0:
+        dx=0.001
+    m = (size-2)//2
+    l = math.sqrt(dx*dx+dy*dy)*(size-4)/2
+    a = math.atan(dy/dx)
+    cv2.line(p,(int(m-l*math.cos(a)),int(m-l*math.sin(a))),(int(m+l*math.cos(a)),int(m+l*math.sin(a))),(0,0,0),1)
+    s = -1 if dx<0 else 1
+    cv2.circle(p,(int(m+s*l*math.cos(a)),int(m+s*l*math.sin(a))),3,0)
+    return p   
+
+def probs(v):
+    v = v-v.min()
+    if (v.sum()>0):
+        v = v/v.sum()
+    return v
+
+class Board:
+    class Cell:
+        empty = 0
+        water = 1
+        wolf = 2
+        tree = 3
+        apple = 4
+    def __init__(self,width,height,size=50):
+        self.width = width
+        self.height = height
+        self.size = size+2
+        self.matrix = np.zeros((width,height))
+        self.grid_color = (0.6,0.6,0.6)
+        self.background_color = (1.0,1.0,1.0)
+        self.grid_thickness = 1
+        self.grid_line_type = cv2.LINE_AA
+        self.pics = {
+            "wolf" : imload('../images/wolf.png',size-4),
+            "apple" : imload('../images/apple.png',size-4),
+            "human" : imload('../images/human.png',size-4)
+        }
+        self.human = (0,0)
+        self.frame_no = 0
+
+    def randomize(self,water_size=5, num_water=3, num_wolves=1, num_trees=5, num_apples=3,seed=None):
+        if seed:
+            random.seed(seed)
+        for _ in range(num_water):
+            x = random.randint(0,self.width-1)
+            y = random.randint(0,self.height-1)
+            for _ in range(water_size):
+                self.matrix[x,y] = Board.Cell.water
+                x = clip(0,self.width-1,x+random.randint(-1,1))
+                y = clip(0,self.height-1,y+random.randint(-1,1))
+        for _ in range(num_trees):
+            while True:
+                x = random.randint(0,self.width-1)
+                y = random.randint(0,self.height-1)
+                if self.matrix[x,y]==Board.Cell.empty:
+                    self.matrix[x,y] = Board.Cell.tree # tree
+                    break
+        for _ in range(num_wolves):
+            while True:
+                x = random.randint(0,self.width-1)
+                y = random.randint(0,self.height-1)
+                if self.matrix[x,y]==Board.Cell.empty:
+                    self.matrix[x,y] = Board.Cell.wolf # wolf
+                    break
+        for _ in range(num_apples):
+            while True:
+                x = random.randint(0,self.width-1)
+                y = random.randint(0,self.height-1)
+                if self.matrix[x,y]==Board.Cell.empty:
+                    self.matrix[x,y] = Board.Cell.apple
+                    break
+
+    def at(self,pos=None):
+        if pos:
+            return self.matrix[pos[0],pos[1]]
+        else:
+            return self.matrix[self.human[0],self.human[1]]
+
+    def is_valid(self,pos):
+        return pos[0]>=0 and pos[0]<self.width and pos[1]>=0 and pos[1] < self.height
+
+    def move_pos(self, pos, dpos):
+        return (pos[0] + dpos[0], pos[1] + dpos[1])
+
+    def move(self,dpos):
+        new_pos = self.move_pos(self.human,dpos)
+        if self.is_valid(new_pos):
+            self.human = new_pos
+
+    def random_pos(self):
+        x = random.randint(0,self.width-1)
+        y = random.randint(0,self.height-1)
+        return (x,y)
+
+    def random_start(self):
+        while True:
+            pos = self.random_pos()
+            if self.at(pos) == Board.Cell.empty:
+                self.human = pos
+                break
+
+
+    def image(self,Q=None):
+        img = np.zeros((self.height*self.size+1,self.width*self.size+1,3))
+        img[:,:,:] = self.background_color
+        # Draw water
+        for x in range(self.width):
+            for y in range(self.height):
+                if (x,y) == self.human:
+                    ov = self.pics['human']
+                    img[self.size*y+2:self.size*y+ov.shape[0]+2,self.size*x+2:self.size*x+2+ov.shape[1],:] = np.minimum(ov,1.0)
+                    continue
+                if self.matrix[x,y] == Board.Cell.water:
+                    img[self.size*y:self.size*(y+1),self.size*x:self.size*(x+1),:] = (0,0,1.0)
+                if self.matrix[x,y] == Board.Cell.wolf:
+                    ov = self.pics['wolf']
+                    img[self.size*y+2:self.size*y+ov.shape[0]+2,self.size*x+2:self.size*x+2+ov.shape[1],:] = np.minimum(ov,1.0)
+                if self.matrix[x,y] == Board.Cell.apple: # apple
+                    ov = self.pics['apple']
+                    img[self.size*y+2:self.size*y+ov.shape[0]+2,self.size*x+2:self.size*x+2+ov.shape[1],:] = np.minimum(ov,1.0)
+                if self.matrix[x,y] == Board.Cell.tree: # tree
+                    img[self.size*y:self.size*(y+1),self.size*x:self.size*(x+1),:] = (0,1.0,0)
+                if self.matrix[x,y] == Board.Cell.empty and Q is not None:
+                    p = probs(Q[x,y])
+                    dx,dy = 0,0
+                    for i,(ddx,ddy) in enumerate([(-1,0),(1,0),(0,-1),(0,1)]):
+                        dx += ddx*p[i]
+                        dy += ddy*p[i]
+                        l = draw_line(dx,dy,self.size)
+                        img[self.size*y+2:self.size*y+l.shape[0]+2,self.size*x+2:self.size*x+2+l.shape[1],:] = l
+
+        # Draw grid
+        for i in range(self.height+1):
+            img[:,i*self.size] = 0.3
+            #cv2.line(img,(0,i*self.size),(self.width*self.size,i*self.size), self.grid_color, self.grid_thickness,lineType=self.grid_line_type)
+        for j in range(self.width+1):
+            img[j*self.size,:] = 0.3
+            #cv2.line(img,(j*self.size,0),(j*self.size,self.height*self.size), self.grid_color, self.grid_thickness,lineType=self.grid_line_type)
+        return img
+
+    def plot(self,Q=None):
+        plt.figure(figsize=(11,6))
+        plt.imshow(self.image(Q),interpolation='hanning')
+
+    def saveimage(self,filename,Q=None):
+        cv2.imwrite(filename,255*self.image(Q)[...,::-1])
+
+    def walk(self,policy,save_to=None,start=None):
+        n = 0
+        if start:
+            self.human = start
+        else:
+            self.random_start()
+
+        while True:
+            if save_to:
+                self.saveimage(save_to.format(self.frame_no))
+                self.frame_no+=1
+            if self.at() == Board.Cell.apple:
+                return n # success!
+            if self.at() in [Board.Cell.wolf, Board.Cell.water]:
+                return -1 # eaten by wolf or drowned
+            while True:
+                a = policy(self)
+                new_pos = self.move_pos(self.human,a)
+                if self.is_valid(new_pos) and self.at(new_pos)!=Board.Cell.water:
+                    self.move(a) # do the actual move
+                    break
+            n+=1
--- a/8-Reinforcement/1-QLearning/translations/README.es.md
+++ b/8-Reinforcement/1-QLearning/translations/README.es.md
--- a/8-Reinforcement/2-Build/assignment.md
+++ b/8-Reinforcement/2-Build/assignment.md
@ -1,9 +0,0 @@
-# [Assignment Name]
-
-## Instructions
-
-## Rubric
-
-| Criteria | Exemplary | Adequate | Needs Improvement |
-| -------- | --------- | -------- | ----------------- |
-|          |           |          |                   |
--- a/8-Reinforcement/2-Build/README.md
+++ b/8-Reinforcement/2-Build/README.md
--- a/8-Reinforcement/1-Concepts/assignment.md
+++ b/8-Reinforcement/1-Concepts/assignment.md
--- a/8-Reinforcement/2-Build/translations/README.es.md
+++ b/8-Reinforcement/2-Build/translations/README.es.md
--- a/8-Reinforcement/2-Gym/translations/README.es.md
+++ b/8-Reinforcement/2-Gym/translations/README.es.md
--- a/8-Reinforcement/README.md
+++ b/8-Reinforcement/README.md
@ -1,12 +1,41 @@
-# Getting Started with 
+# Getting Started with Reinforcement Learning

-In this section of the curriculum, you will be introduced to ...
+[![Intro to Reinforcement Learning](https://img.youtube.com/vi/lDq_en8RNOo/0.jpg)](https://www.youtube.com/watch?v=lDq_en8RNOo)

-## Lessons
+## Regional Topic: Peter and the Wolf (Russia)
+
+[Peter and the Wolf](https://en.wikipedia.org/wiki/Peter_and_the_Wolf) is a musical fairy tale written by a Russian composer [Sergei Prokofiev](https://en.wikipedia.org/wiki/Sergei_Prokofiev). It is a story about young pioneer Peter, who bravely goes out of his house to the forest clearing to chase the wolf. In this section, we will train machine learning algorithms that will help Peter:
+* to explore the surroinding area and build an optimal navigation map
+* to learn how to use a skateboard and balance on it, in order to move around faster.
+
+## Introduction to Reinforcement Learning
+
+In previous sections, you have seen two example of machine learning problems:
+* **Supervised**, where we had some datasets that show sample solutions to the problem we want to solve. [Classification][Classification] and [regression][Regression] are supervised learning tasks.
+* **Unsupervised**, in which we do not have training data. The main example of unsupervised learning is [clustering][Clustering].
+
+In this section, we will introduce you to a new type of learning problems, which do not require labeled training data. There are a several types of such problems:
+
+* **[Semi-supervised learning](https://en.wikipedia.org/wiki/Semi-supervised_learning)**, where we have a lot of unlabeled data that can be used to pre-train the model.
+* **[Reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning)**, in which the agent learns how to behave by performing a lot of experiments in some simulated environment.

-1. [Introduction to](1-intro-to/README.md)
+Suppose, you want to teach computer to play a game, such as chess, or [Super Mario](https://en.wikipedia.org/wiki/Super_Mario). For computer to play a game, we need it to predict which move to make in each of the game states. While this may seem like a classification problem, it is not - because we do not have a dataset with states and corresponding actions. While we may have some data like that (existing chess matches, or recording of players playing Super Mario), it is likely not to cover sufficiently large number of possible states.

+Instead of looking for existing game data, **reinforcement learning** (RL) is based on the idea of *making computer play* many times, observing the result. Thus, to apply reinforcement learning, we need two things:
+1. **An environment** and **a simulator**, which would allow us to play a game many times. This simulator would define all game rules, possible states and actions.
+2. **A reward function**, which would tell us how good we did during each move or game.
+
+The main difference between supervised learning is that in RL we typically do not know whether we win or lose until we finish the game. Thus, we cannot say whether a certain move alone is good or now - we only receive reward at the end of the game. And our goal is to design such algorightms that will allow us to train a model under such uncertain conditions. We will learn about one RL algorithm called **Q-learning**.
+
+## Lessons
+
+1. [Introduction to Reinforcement Learning and Q-Learning](1-qlearning/README.md)
+2. [Using gym simulation environment](2-gym/README.md)

 ## Credits

-"Introduction to" was written with ♥️ by [Name](Twitter)
+"Introduction to" was written with ♥️ by [Dmitry Soshnikov](http://soshnikov.com)
+
+[Classification]: ../4-Classification/README.md
+[Regression]: ../2-Regression/README.md
+[Clustering]: ../5-Clustering/README.md