Throughout our lives, we learn how to behave through trial and error. What if we could teach machines using the same principle? From football-playing robots to chemical reaction simulations, deep reinforcement learning is unlocking new levels of intelligence in machines. By Julian Tan
Imagine training a dog to sit by offering treats every time it follows your command. Over time, the dog learns to associate sitting with rewards, repeating the behaviour whenever you say “sit”. This simple yet powerful method of positive reinforcement is at the heart of dog training—and it works wonders. What if we could use the same approach to train robots? (Ziv, 2017)
This is the essence of Reinforcement Learning (RL). Just like how dogs learn to sit and roll over, robots can be trained to fetch critical data and sniff out new opportunities. But how does this process mirror the training of our canine companions, and how far can we take it?
Reinforcement Learning
In RL, an agent (e.g., a dog) interacts with its environment (e.g., the dog trainer) to learn how to achieve specific goals. Each time the agent takes an action, the environment responds with feedback - either a reward or a penalty - guiding the agent towards better decisions over time (Kaelbling, Littman and Moore, 1996) Just like in dog training, the goal is to maximise rewards by performing the right actions in the right situations.
This representation is called the Markov Decision Process, which has four main ideas: states, actions, policies, and rewards. Based on the agent’s situation—such as a dog hearing a command—the agent can choose from a set of possible actions. Each of these actions correspond to a reward, depending on the action chosen. If the dog spins around when commanded to lie down, it receives no reward.
Figure 1: RL feedback loop, showing the agent’s interaction with its environment.
But how does the agent—whether it’s a dog or a robot—know which action brings the best rewards? That’s where the policy comes in. It is the agent’s strategy, mapping each state to optimal actions. The policy is a central part of the RL process. Powerful policies have led to algorithms that can surpass humans in games like chess and Go (Silver et al., 2018).
Deep Reinforcement Learning
RL algorithms, like dogs, struggle with complex tasks involving high-dimensional data. Fortunately, deep learning (also known as neural networks) is designed to handle exactly those challenges (Wang et al., 2024). Recent advancements in areas like language processing (e.g., ChatGPT) have shown us the impressive capabilities of deep learning. Its layered architecture allows it to capture increasingly intricate patterns (Liesenfeld, Lopez and Dingemanse, 2023).
Figure 2: DRL decision process.
When combined, deep learning and reinforcement learning form Deep Reinforcement Learning (DRL). In DRL, deep learning algorithms replace traditional policies by processing high-dimensional inputs for complex tasks, like controlling a humanoid robot. It takes in the initial states (e.g., joint angles) and actions (e.g., joint movements), and then calculates the rewards for performing each action. Thanks to deep learning’s ability to generalise new and unseen data, DRL systems are highly adaptable (Wang et al., 2024).
Applications
This adaptability makes DRL effective at complex tasks with unpredictable inputs. For example, DRL-controlled humanoid robots in football games have learnt skills like running, turning, kicking, and recovering from falls. They have even gained unexpected abilities that were not originally programmed, such as intercepting shots at the goal. Furthermore, the robots trained this way outperformed manually scripted robots, performing movements faster (Haarnoja et al., 2024).
DRL was also used to teach flapping-wing drones how to fly. By measuring wing deformation, drones learned to adjust wing flapping power and orientation without relying on traditional sensors like accelerometers or gyroscopes. This mirrors the way insects use mechanosensory receptors on their wings to improve flight control (Kim et al., 2024).
Beyond robotics, DRL has shown promise in fields like medicine and chemistry. It was used to optimise the dosage of chemotherapy treatments based on tumour growth (Mahmud et al., 2018). Additionally, DRL was used to understand catalytic reaction mechanisms, such as the Haber-Bosch process, which is used to produce ammonia. The algorithm discovered a previously unknown intermediate state, lowering the activation energy required (Lan, Wang and An, 2024).
Deep reinforcement learning has expanded the reach of traditional reinforcement learning, enabling its application in a growing range of areas. As computational power and DRL algorithms continue to improve, this trend will accelerate, unlocking even more impactful solutions.
References
Haarnoja, T., Moran, B., Lever, G., Huang, S. H., Tirumala, D., Humplik, J., Wulfmeier, M., Tunyasuvunakool, S., Siegel, N. Y., Hafner, R., Bloesch, M., Hartikainen, K., Byravan, A., Hasenclever, L., Tassa, Y., Sadeghi, F., Batchelor, N., Casarini, F., Saliceti, S., Game, C., Sreendra, N., Patel, K., Gwira, M., Huber, A., Hurley, N., Nori, F., Hadsell, R. and Heess, N. (2024). ‘Learning agile soccer skills for a bipedal robot with deep reinforcement learning’. Science Robotics, 9 (89), p. eadi8022. doi: 10.1126/scirobotics.adi8022.
Kaelbling, L. P., Littman, M. L. and Moore, A. W. (1996). ‘Reinforcement Learning: A Survey’. Journal of Artificial Intelligence Research, 4, pp. 237–285. doi: 10.1613/jair.301.
Kim, T., Hong, I., Im, S., Rho, S., Kim, M., Roh, Y., Kim, C., Park, J., Lim, D., Lee, D., Lee, S., Lee, Jingoo, Back, I., Cho, J., Hong, M. R., Kang, S., Lee, Joonho, Seo, S., Kim, U., Choi, Y.-M., Koh, J., Han, S. and Kang, D. (2024). ‘Wing-strain-based flight control of flapping-wing drones through reinforcement learning’. Nature Machine Intelligence, 6 (9), pp. 992–1005. doi: 10.1038/s42256-024-00893-9.
Lan, T., Wang, H. and An, Q. (2024). ‘Enabling high throughput deep reinforcement learning with first principles to investigate catalytic reaction mechanisms’. Nature Communications, 15 (1), p. 6281. doi: 10.1038/s41467-024-50531-6.
Liesenfeld, A., Lopez, A. and Dingemanse, M. (2023). ‘Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators’. in Proceedings of the 5th International Conference on Conversational User Interfaces. CUI ’23: ACM conference on Conversational User Interfaces, Eindhoven Netherlands: ACM, pp. 1–6. doi: 10.1145/3571884.3604316.
Mahmud, M., Kaiser, M. S., Hussain, A. and Vassanelli, S. (2018). ‘Applications of Deep Learning and Reinforcement Learning to Biological Data’. IEEE Transactions on Neural Networks and Learning Systems, 29 (6), pp. 2063–2079. doi: 10.1109/TNNLS.2018.2790388.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K. and Hassabis, D. (2018). ‘A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play’. Science, 362 (6419), pp. 1140–1144. doi: 10.1126/science.aar6404.
Wang, X., Wang, S., Liang, X., Zhao, D., Huang, J., Xu, X., Dai, B. and Miao, Q. (2024). ‘Deep Reinforcement Learning: A Survey’. IEEE Transactions on Neural Networks and Learning Systems, 35 (4), pp. 5064–5078. doi: 10.1109/TNNLS.2022.3207346.
Ziv, G. (2017). ‘The effects of using aversive training methods in dogs—A review’. Journal of Veterinary Behavior, 19, pp. 50–60. doi: 10.1016/j.jveb.2017.02.004.