Machine Learning — Reinforcement Learning — Key Elements | by Devsena Mishra | Medium

Published Aug 01, 2024

Reinforcement learning (RL) is a type of machine learning that involves an agent learning how to take actions in an environment to maximize a cumulative reward signal.

On that basis, there are four key elements of a Reinforcement Learning System: Environment, Agent, Reward Signal and Learning algorithm

In their book “Reinforcement Learning: An Introduction” (second edition) Richard S. Sutton and Andrew G. Barto, make an interesting argument when they mention that: “Beyond the agent and the environment, one can identify four main subelements of a reinforcement learning system: a policy, a reward signal, a value function, and, optionally, a model of the environment.”

Let’s see one by one:

Policy — A policy defines the learning agent’s way of behaving at a given time. In Reinforcement Learning (RL), a policy is a set of rules that determines the actions that an agent should take in a given state. The policy maps the current state of the environment to a probability distribution over actions. The goal of the RL agent is to learn the optimal policy that maximizes the expected cumulative reward over time.

It corresponds to what in psychology would be called a set of stimulus-response rules or associations. In some cases, the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process.

There are two types of policies in RL:

Deterministic policy: A deterministic policy is a policy that maps each state to a specific action. For example, a deterministic policy might always select the action with the highest expected reward for a given state.

Stochastic policy: A stochastic policy is a policy that maps each state to a probability distribution over actions. For example, a stochastic policy might select actions randomly based on their probabilities.

The choice of policy depends on the nature of the problem and the available information about the environment.

In some cases, a deterministic policy may be more appropriate, while in others, a stochastic policy may be necessary to explore the environment effectively. In a nutshell, the policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behaviour.

Next is the Reward Signal — A reward signal defines the goal of a reinforcement learning problem. On each time step, the environment sends to the reinforcement learning agent a single number called the reward. The agent’s sole objective is to maximize the total reward it receives over the long run. The reward signal thus defines what are the good and bad events for the agent.

The reward signal is a scalar value that can be positive, negative, or zero. It represents the immediate feedback that the agent receives from the environment after taking an action. The reward signal can be a function of various factors, such as the current state of the environment, the action taken by the agent, and the goals of the RL system.

The reward signal is a crucial component of the RL system since it guides the agent’s behaviour towards achieving the desired outcome. The agent learns to associate certain actions with positive or negative rewards, and it adjusts its policy accordingly. The agent’s goal is to maximize the cumulative reward over time, which requires a balance between exploring new actions and exploiting previously learned information.

A well-designed reward signal can help the agent to learn quickly and efficiently, while a poorly designed reward signal can lead to suboptimal behaviour or even failure of the RL system. In general, reward signals may be stochastic functions of the state of the environment and the actions taken.

Next, we have a Value Function: whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run.

In Reinforcement Learning (RL), the value function is a function that estimates the expected cumulative reward that an agent can obtain by following a specific policy from a given state. The value function is an essential component of many RL algorithms since it provides a way to evaluate the quality of different policies.

The value function can be defined in two ways:

State-value function (V): The state-value function is a function that estimates the expected cumulative reward that an agent can obtain from a given state following a particular policy.

Action-value function (Q): The action-value function is a function that estimates the expected cumulative reward that an agent can obtain from a given state and takes a particular action following a particular policy.

The value function is a crucial component of many RL algorithms since it provides a way to compare different policies and select the optimal one.

The value function can be learned through various techniques such as Monte Carlo methods, temporal difference learning, and Q-learning.

The value function is used to guide the agent’s behaviour towards achieving the desired outcome. The agent learns to estimate the value function and adjusts its policy accordingly to maximize the expected cumulative reward. We can say Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary. Without rewards, there could be no values, and the only purpose of estimating values is to achieve more rewards.

Nevertheless, it is values with which we are most concerned when making and evaluating decisions. Action choices are made based on value judgments.

But it is much harder to determine values than it is to determine rewards. Rewards are given directly by the environment, but values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime.

The fourth and final element of some reinforcement learning systems is a model of the environment.

Model of the environment, is something that mimics the behaviour of the environment, or more generally, that allows inferences to be made about how the environment will behave. The model can be used for planning and decision-making, allowing the agent to simulate possible future trajectories and select the optimal action.

It can be represented in different ways, depending on the nature of the problem and the available information about the environment. In some cases, the model may be explicit, meaning that the agent has full knowledge of the environment’s dynamics, including the transition probabilities and the rewards associated with each state-action pair. In other cases, the model may be implicit, meaning that the agent has limited or incomplete knowledge of the environment’s dynamics, and it needs to learn the model from experience.

Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners — viewed as almost the opposite of planning.

The model of the environment is not always necessary in RL, and many RL algorithms can learn directly from experience without an explicit model. However, having a model can be useful in some scenarios, such as when the environment is complex, and exploration is expensive or time-consuming.

For more detail, watch these videos:
Machine Learning — Reinforcement Learning — Key Elements — Part I
https://www.youtube.com/watch?v=PaHJ99ac4NU

Machine Learning — Reinforcement Learning — Key Elements — Part II
https://www.youtube.com/watch?v=A72ys30BXPQ

Machine learning Reinforcement learning AI Technology Culture

Report

Enjoy this post? Give Devsena Mishra a like if it's helpful.

Devsena Mishra

Full Stack Mentor - Expert in Advanced Technologies/Frameworks

I have 16-plus years of experience in different domains. I have worked on diverse technologies, including Java, Python, JavaScript, React, Node.js, TypeScript, .NET, C/C++, SQL and Oracle. In parallel to training/mentoring profes...

Discover and read more posts from Devsena Mishra

get started