A Deep Hierarchical Approach To Lifelong Learning In Minecraft

From Camera Database
Jump to: navigation, search

The ability to reuse or transfer knowledge from one task to another in lifelong learning problems, such as Minecraft, is one of the major challenges faced in AI. Reusing knowledge across tasks is crucial to solving tasks efficiently with lower sample complexity. We provide a Reinforcement Learning agent with the ability to transfer knowledge by learning reusable skills, a type of temporally extended action (also know as Options (Sutton et. al. 1999)). The agent learns reusable skills using Deep Q Networks (Mnih et. al. 2015) to solve tasks in Minecraft, a popular video game which is an unsolved and high-dimensional lifelong learning problem. These reusable skills, which we refer to as Deep Skill Networks (DSNs), are then incorporated into our novel Hierarchical Deep Reinforcement Learning Network (H-DRLN) architecture. The H-DRLN is a hierarchical version of Deep Q-Networks and learns to efficiently solve tasks by reusing knowledge from previously learned DSNs. The H-DRLN exhibits superior performance and lower learning sample complexity (by taking advantage of temporal extension) compared to the regular Deep Q Network (Mnih et. al. 2015) in sub-domains of Minecraft. We also show the potential to transfer knowledge between related Minecraft tasks without any additional learning.



1 Introduction



Lifelong learning is defined as the ability to accumulate knowledge across multiple tasks and then reuse or transfer this knowledge in order to solve sub-sequent tasks Eaton and Ruvolo (2013). This is one of the fundamental learning problems in AI Thrun and Mitchell (1995); Eaton and Ruvolo (2013). There are different ways to performing knowledge transfer across tasks. For example, Ammar et al. (2014) and Ammar et al. (2015) transfer knowledge via a latent basis whereas Brunskill and Li (2014) perform batch optimization across all encountered tasks.



Lifelong learning in real-world domains suffers from the curse of dimensionality. That is, as the state and action spaces increase, it becomes more and more difficult to model and solve new tasks as they are encountered. A challenging, high-dimensional domain that incorporates many of the elements found in life-long learning is Minecraft 111https://minecraft.net/. Minecraft is a popular video game whose goal is to build structures, travel on adventures, hunt for food and avoid zombies. An example screenshot from the game is seen in Figure 1. Minecraft is an open research problem as it is impossible to solve the entire game using a single AI technique. Instead, the solution to Minecraft may lie in solving sub-problems, using a divide-and-conquer approach, and then providing a synergy between the various solutions. Once an agent learns to solve a sub-problem, it has acquired a skill that can then be reused when a similar sub-problem is subsequently encountered.



Many of the tasks that are encountered by an agent in a lifelong learning setting can be naturally decomposed into skill hierarchies Stone et al. (2000, 2005); Bai et al. (2015). In Minecraft for example, consider building a wooden house as seen in Figure 1. This task can be decomposed into sub-tasks (a.k.a skills) such as chopping trees, sanding the wood, cutting the wood into boards and finally nailing the boards together. Here, the knowledge gained from chopping trees can also be partially reused when cutting the wood into boards. In addition, if the agent receives a new task to build a small city, then the agent can reuse the skills it acquired during the ‘building a house’ task.



In a lifelong learning setting such as Minecraft, learning skills and when to reuse the skills are key to increasing exploration, efficiently solving tasks and advancing the capabilities of the Minecraft agent. As mentioned previously, Minecraft and other lifelong learning problems suffer from the curse of dimensionality. Therefore as the dimensionality of the problem increases, it becomes increasingly non-trivial to learn reasonable skills as well as when to reuse these skills.



Reinforcement Learning (RL) provides a generalized approach to skill learning through the options framework Sutton et al. (1999). Options are Temporally Extended Actions (TEAs) and are also referred to as skills da Silva et al. (2012) and macro-actions Hauskrecht et al. (1998). Options have been shown both theoretically Precup and Sutton (1997); Sutton et al. (1999) and experimentally Mann and Mannor (2013) to speed up the convergence rate of RL algorithms. From here on in, we will refer to options as skills.



Recent work in RL has provided insights into learning reusable skills Mankowitz et al. (2016a, b), but this has been limited to low dimensional problems. In high-dimensional lifelong learning settings (E.g. Minecraft), learning from visual experiences provides a potential solution to learning reusable skills. With the emergence of Deep Reinforcement Learning, specifically Deep Q-Networks (DQNs)Mnih et al. (2015), RL agents are now equipped with a powerful non-linear function approximator that can learn rich and complex policies. Using these networks the agent learns policies (or skills) from raw image pixels, requiring less domain specific knowledge to solve complicated tasks (E.g Atari video games Mnih et al. (2015)). While different variations of the DQN algorithm exist Van Hasselt et al. (2015); Schaul et al. (2015); Wang et al. (2015); Bellemare et al. (2015), we will refer to the Vanilla DQN Mnih et al. (2015) unless otherwise stated.



In our paper, we focus on learning reusable RL skills using Deep Q Networks Mnih et al. (2015), by solving sub-problems in Minecraft. These reusable skills, which we refer to as Deep Skill Networks (DSNs) are then incorporated into our novel Hierarchical Deep Reinforcement Learning (RL) Network (H-DRLN) architecture. The H-DRLN, which is a hierarchical version of the DQN, learns to solve more complicated tasks by reusing knowledge from the pre-learned DSNs. By taking advantage of temporal extension, the H-DRLN learns to solve tasks with lower sample complexity and superior performance compared to vanilla DQNs.



Contributions: (1) A novel Hierarchical Deep Reinforcement Learning Network (H-DRLN) architecture. (2) We show the potential to learn reusable Deep Skill Networks (DSNs) and perform knowledge transfer of the learned DSNs to a new task to obtain an optimal solution. (3) Empirical results for learning a H-DRLN in a sub-domain of Minecraft which outperforms the vanilla DQN. (4) We verify empirically the improved convergence guarantees for utilizing reusable DSNs (a.k.a options) within the H-DRLN, compared to the vanilla DQN. (5) The potential to transfer knowledge between related tasks without any additional learning.



2 Background



Reinforcement Learning: The goal of an RL agent is to maximize its expected return by learning a policy π:S→ΔA:𝜋→𝑆subscriptΔ𝐴\pi:S\rightarrow\Delta_Aitalic_π : italic_S → roman_Δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT which is a mapping from states s∈S𝑠𝑆s\in Sitalic_s ∈ italic_S to a probability distribution over the action space A𝐴Aitalic_A. At time t𝑡titalic_t the agent observes a state st∈Ssubscript𝑠𝑡𝑆s_t\in Sitalic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_S, selects an action at∈Asubscript𝑎𝑡𝐴a_t\in Aitalic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_A, and receives a bounded reward rt∈[0,Rmax]subscript𝑟𝑡0subscript𝑅r_t\in[0,R_\max]italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] where Rmaxsubscript𝑅R_\maxitalic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is the maximum attainable reward and γ∈[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] is the discount factor. Following the agents action choice, it transitions to the next state st+1∈Ssubscript𝑠𝑡1𝑆s_t+1\in Sitalic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∈ italic_S . We consider infinite horizon problems where the cumulative return at time t𝑡titalic_t is given by Rt=∑t′=t∞γt′-trtsubscript𝑅𝑡superscriptsubscriptsuperscript𝑡′𝑡superscript𝛾superscript𝑡′𝑡subscript𝑟𝑡R_t=\sum_t^\prime=t^\infty\gamma^t^\prime-tr_titalic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The action-value function Qπ(s,a)=𝔼[Rt|st=s,at=a,π]superscript𝑄𝜋𝑠𝑎𝔼delimited-[]formulae-sequenceconditionalsubscript𝑅𝑡subscript𝑠𝑡𝑠subscript𝑎𝑡𝑎𝜋Q^\pi(s,a)=\mathbbE[R_t|s_t=s,a_t=a,\pi]italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) = blackboard_E [ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a , italic_π ] represents the expected return after observing state s𝑠sitalic_s and taking an action a𝑎aitalic_a under a policy π𝜋\piitalic_π. The optimal action-value function obeys a fundamental recursion known as the Bellman equation,



Q*(st,at)=𝔼[rt+γmaxa′Q*(st+1,a′)]superscript𝑄subscript𝑠𝑡subscript𝑎𝑡𝔼delimited-[]subscript𝑟𝑡𝛾superscript𝑎′maxsuperscript𝑄subscript𝑠𝑡1superscript𝑎′Q^*(s_t,a_t)=\mathbbE\left[r_t+\gamma\underseta^\prime\mathrm% maxQ^*(s_t+1,a^\prime)\right]italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ start_UNDERACCENT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_max end_ARG italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]



Deep Q Networks: The DQN algorithm Mnih et al. (2015) approximates the optimal Q function with a Convolutional Neural Network (CNN) Krizhevsky et al. (2012), by optimizing the network weights such that the expected Temporal Difference (TD) error of the optimal bellman equation (Equation 1) is minimized:



𝔼st,at,rt,st+1∥Qθ(st,at)-yt∥22,subscript𝔼subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1superscriptsubscriptnormsubscript𝑄𝜃subscript𝑠𝑡subscript𝑎𝑡subscript𝑦𝑡22\mathbbE_s_t,a_t,r_t,s_t+1\left\|Q_\theta\left(s_t,a_t\right% )-y_t\right\|_2^2\enspace,blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (1)



where



yt=rtif st+1 is terminalrt+γmaxa’Qθtarget(st+1,a′)otherwise.subscript𝑦𝑡casessubscript𝑟𝑡if subscript𝑠𝑡1 is terminalsubscript𝑟𝑡𝛾a’maxsubscript𝑄subscript𝜃𝑡𝑎𝑟𝑔𝑒𝑡subscript𝑠𝑡1superscript𝑎′otherwisey_t=\begincasesr_t&\mboxif s_t+1\mbox is terminal\\ r_t+\gamma\underset\mbox\mbox$a$'\mboxmaxQ_\theta_target\left(% s_t+1,a^^\prime\right)&\mboxotherwise\endcases\enspace.italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_σ ] Under these definitions the optimal skill value function is given by the following equation Stolle and Precup as:



QΣ*(s,σ)=𝔼[Rsσ+γkmaxσ′∈ΣQΣ*(s′,σ′)].superscriptsubscript𝑄Σ𝑠𝜎𝔼delimited-[]superscriptsubscript𝑅𝑠𝜎superscript𝛾𝑘superscript𝜎′Σmaxsuperscriptsubscript𝑄Σsuperscript𝑠′superscript𝜎′Q_\Sigma^*(s,\sigma)=\mathbbE[R_s^\sigma+\gamma^k\underset\sigma^% \prime\in\Sigma\mathrmmaxQ_\Sigma^*(s^\prime,\sigma^\prime)]\enspace.italic_Q start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_σ ) = blackboard_E [ italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT + italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_UNDERACCENT italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Σ end_UNDERACCENT start_ARG roman_max end_ARG italic_Q start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] . (2)



3 Hierarchical Deep RL Network



The Hierarchical Deep RL Network (H-DRLN) is a new architecture, based on the DQN, that facilitates skill reuse in lifelong learning. In this section, we provide an in-depth description of this network architecture as well as necessary modifications that we implemented in order to convert a vanilla DQN into its hierarchical counterpart.



H-DRLN architecture: A diagram of this architecture is presented in Figure 2. Here, the outputs of the H-DRLN consist of primitive actions (E.g. Left (L), Right (R) and Forward (F)) as well as skills. The H-DRLN learns a policy that determines when to execute primitive actions and when to reuse pre-learned skills. The pre-learned skills are represented with deep networks and are referred to as Deep Skill Networks (DSNs). They are trained a-priori on various sub-tasks using the vanilla DQN algorithm and the regular Experience Replay (ER) detailed in Section 2.



If the H-DRLN chooses to execute a primitive action atsubscript𝑎𝑡a_titalic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t, then the action is executed for a single timestep. However, if the H-DRLN chooses to execute a skill σisubscript𝜎𝑖\sigma_iitalic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (and therefore a DSN as shown in Figure 2), then the DSN executes its policy, πDSN(s)subscript𝜋𝐷𝑆𝑁𝑠\pi_DSN(s)italic_π start_POSTSUBSCRIPT italic_D italic_S italic_N end_POSTSUBSCRIPT ( italic_s ) for a duration of kisubscript𝑘𝑖k_iitalic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT timesteps and then gives control back to the H-DRLN.



This gives rise to two necessary modifications that we needed to make in order to incorporate skills into the learning procedure and generate a truly hierarchical deep network: (1) Optimize an objective function that incorporates skills; (2) Construct an ER that stores skill experiences.



Skill Objective Function: As mentioned previously, a H-DRLN extends the vanilla DQN architecture to learn control between primitive actions and skills. The H-DRLN loss function has the same structure as Equation 1, however instead of minimizing the standard Bellman equation, we minimize the Skill Bellman equation (Equation 2). More specifically, for a skill σtsubscript𝜎𝑡\sigma_titalic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT initiated in state stsubscript𝑠𝑡s_titalic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that has a time duration k𝑘kitalic_k, the H-DRLN target function is given by:



yt=∑j=0k-1[γjrj+t]if st+k is terminal∑j=0k-1[γjrj+t]+γkmaxσ’Qθtarget(st+k,σ′)else,subscript𝑦𝑡casessuperscriptsubscript𝑗0𝑘1delimited-[]superscript𝛾𝑗subscript𝑟𝑗𝑡if subscript𝑠𝑡𝑘 is terminalsuperscriptsubscript𝑗0𝑘1delimited-[]superscript𝛾𝑗subscript𝑟𝑗𝑡superscript𝛾𝑘σ’maxsubscript𝑄subscript𝜃𝑡𝑎𝑟𝑔𝑒𝑡subscript𝑠𝑡𝑘superscript𝜎′elsey_t=\begincases\sum_j=0^k-1\left[\gamma^jr_j+t\right]&\mboxif s_% t+k\mbox is terminal\\ \sum_j=0^k-1\left[\gamma^jr_j+t\right]+\par\gamma^k\underset\mbox% \mbox$\sigma$'\mboxmaxQ_\theta_target\left(s_t+k,\sigma^^% \prime\right)&\mboxelse\endcases\enspace,italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT [ italic_γ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_j + italic_t end_POSTSUBSCRIPT ] end_CELL start_CELL if italic_s start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT is terminal end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT [ italic_γ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_j + italic_t end_POSTSUBSCRIPT ] + italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT underσ’ start_ARG max end_ARG italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) end_CELL start_CELL else end_CELL end_ROW ,



ensuring that temporal extension is preserved.



Skill - Experience Replay: We extend the regular ER Mnih et al. (2015) to incorporate skills and have termed this Skill Experience Replay (S-ER). There are two differences between the standard ER and our S-ER. Firstly, for each sampled skill tuple, we calculate the sum of discounted cumulative rewards generated whilst executing the skill and store this sum in the variable r~~𝑟\tilderover~ start_ARG italic_r end_ARG. Second, since the skill is executed for k𝑘kitalic_k timesteps, we store the transition to state st+ksubscript𝑠𝑡𝑘s_t+kitalic_s start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT rather than st+1subscript𝑠𝑡1s_t+1italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. This yields the skill tuple (st,σt,r~t,st+k)subscript𝑠𝑡subscript𝜎𝑡subscript~𝑟𝑡subscript𝑠𝑡𝑘(s_t,\sigma_t,\tilder_t,s_t+k)( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT ) where σtsubscript𝜎𝑡\sigma_titalic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the skill executed at time t𝑡titalic_t.



4 Experiments



In order to solve new tasks as they are encountered in a lifelong learning scenario, the agent needs to be able to adapt to new game dynamics as well as reuse skills that it has learned from solving previous tasks. In our experiments, we show (1) the ability of the Minecraft agent to learn a DSN on a Minecraft task termed the one-room domain, shown in Figure 3. We then show (2) the ability of the agent to reuse this DSN to solve a new task, termed the two-room domain shown in Figure 5, by learning a Hierarchical Deep RL Network (H-DRLN) which incorporates this DSN as an action output. Finally, we show (3) the potential to transfer knowledge between related tasks without any additional learning.



Deep Network Architecture - The deep network architecture used to represent the DSN and H-DRLN is the same as that of the vanilla DQN architecture Mnih et al. (2015). The H-DRLN however has a different Q-layer with a DSN as an output. State space - As in Mnih et al. (2015), the state space is represented as raw image pixels from the last four image frames which are combined and down-sampled into an 84×84848484\times 8484 × 84 image which is then vectorized. Actions - The action space for the DSN consists of three actions: (1) Move forward (F), (2) Rotate left by 30∘superscript3030^\circ30 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT (L) and (3) Rotate right by 30∘superscript3030^\circ30 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. I Love The H-DRLN contains the same set of actions as well as the DSN as a fourth action output. Rewards - The agent gets a negative reward that is proportional to its distance to the goal, where the goal is to exit the room. In addition, upon reaching the goal the agent receives a large positive reward. Training - the agent learns in epochs. Each epoch starts from a random location in the domain and terminates after the agent makes 30 (60) steps in the one (two)-room domain. Evaluation - In all of the simulations, we evaluated the agent during training using the current learned architecture every 20k (5k) optimization steps. During evaluation, we averaged the agent’s performance over 500 (1k) steps.



4.1 Training a DSN



Our first experiment involved training a DSN in the one room domain (Figure 3). To do so we used the Vanilla DQN parameters that worked on the Atari domain Mnih et al. (2015) as a starting point and then performed a grid search to find the optimal parameters for learning a DSN for the Minecraft one-room domain. The best parameter settings that we found include: (1) a higher learning ratio (iterations between emulator states, n-replay = 16), (2) higher learning rate (learning rate = 0.0025) and (3) less exploration (eps_endt - 400K). We implemented these modifications, since the standard Minecraft emulator 222https://github.com/h2r/burlapcraft has a slow frame rate (approximately 400400400400 ms per emulator timestep), and these modifications enabled the agent to increase its learning between game states. We also found that a smaller experience replay (replay_memory - 100K) provided improved performance, probably due to our task having a relatively short time horizon (approximately 60606060 timesteps). The rest of the parameters from the Vanilla DQN remained unchanged, since Minecraft and Atari Mnih et al. (2015) share relatively similar in-game screen resolutions.



Figure 4 presents the results of the learned DSN. As seen in the figure, the DSN learned to effectively solve the task - its average reward is increasing (orange) and the agent is able to solve the task with 100%percent100100\%100 % success (blue) after approximately 31313131 epochs of training.



4.2 Training a H-DRLN with a DSN



Using the DSN trained on the one-room domain, we incorporated it into the H-DRLN network architecture as an action output and learned to solve a new Minecraft sub-domain, the two-room domain presented in Figure 5. This domain consists of two-rooms, where the first room is shown in Figure 5a𝑎aitalic_a with its corresponding exit (Figure 5b𝑏bitalic_b). Note that the exit of the first room is not identical to the exit of the one-room domain. The second room contains a goal (Figure 5c𝑐citalic_c) that is the same as the goal of the one-room domain (Figure 3b𝑏bitalic_b).



Skill Reusability/Knowledge Transfer: We trained the H-DRLN architecture as well as the vanilla DQN on the two-room domain. The average reward per epoch is shown in Figure 6. We noticed two important observations. (1) The H-DRLN architecture solves the task after a single epoch and generates significantly higher reward compared to the vanilla DQN as seen in the figure. This is because the H-DRLN makes use of knowledge transfer by reusing the DSN trained on the one-room domain to solve the two-room domain. This DSN is able to identify the exit of the first room (which is different from the exit on which the DSN was trained) and navigates the agent to this exit. The DSN is also able to navigate the agent to the exit of the second room and completes the task. The DSN is a temporally extended action as it lasts for multiple timesteps and therefore increases the exploration of the RL agent enabling it to learn to solve the task faster than the vanilla DQN. A video 333 https://www.youtube.com/watch?v=RwjfE4kc6j8 showing the performance of the Minecraft agent using the learned H-DRLN in the two room domain is available online. (2) The vanilla DQN fails to solve the task after 39393939 epochs. Since the domain contains ambiguous looking walls, the agent tends to get stuck in sub-optimal local minima. The agent completes the task approximately 80%percent8080\%80 % of the time using the H-DRLN whereas it completes the task approximately 5%percent55\%5 % of the time using the vanilla DQN after 39393939 epochs.



Temporal Extension combined with Primitive Actions: Figure 7 shows the action distribution of the agent’s policy during training. We can see that the H-DRLN learns when to make use of temporal extension by reusing the DSN and when to use primitive actions. The DSN runs for 10101010 timesteps or terminates early if the agent reaches the goal. 10101010 timesteps is not sufficient to reach the exit of either room and therefore the agent needs to rely on primitive actions, as well as the DSN, in order to solve the given task.



Knowledge Transfer without Learning: We then decided to evaluate the DSN (which we trained on the one-room domain) in the two-room domain without performing any additional learning on this network. We found it surprising that the DSN, without any training on the two-room domain, generated a higher reward compared to the vanilla DQN which was specifically trained on the two-room domain for 39393939 epochs as shown in Figure 8. The DSN performance is not optimal compared to the H-DRLN architecture as seen in the figure but still manages to solve the two-room domain. This is an exciting result as it shows the potential for DSNs to identify and solve related tasks without performing any additional learning.



5 Discussion



We have provided the first results for learning Deep Skill Networks (DSNs) in Minecraft, a lifelong learning domain. The DSNs are learned using a Minecraft-specific variation of the DQN Mnih et al. (2015) algorithm. Our Minecraft agent also learns how to reuse these DSNs on new tasks by utilizing our novel Hierarchical Deep RL Network (H-DRLN) architecture. In addition, we show that the H-DRLN provides superior learning performance and faster convergence compared to the vanilla DQN, by making use of temporal extension Sutton et al. (1999). Our work can also be interpreted as a form of curriculum learning Bengio et al. (2009) for RL. Here, we first train the network to solve relatively simple sub-tasks and then use the knowledge it obtained to solve the composite overall task. We also show the potential to perform knowledge transfer between related tasks without any additional learning. We see this work as a building block towards truly general lifelong learning using hierarchical RL and Deep Networks.



Recently, it has been shown that Deep Networks tend to implicitly capture the hierarchical composition of a given task Zahavy et al. (2016). In future work we plan to utilize this implicit hierarchical composition to learn DSNs. In addition, it is possible to distill much of the knowledge from multiple teacher networks into a single student network Parisotto et al. (2015); Rusu et al. (2015). We wish to perform a similar technique as well as add auxiliary tasks to train the teacher networks (DSNs) Suddarth and Kergosien (1990), ultimately guiding learning in the student network (our H-DRLN).