Machine Learning — Reinforcement Learning — Some Background | by Devsena Mishra | Medium

Published Aug 01, 2024

அறத்தினூஉங்கு ஆக்கமும் இல்லை அதனை
மறத்தலின் ஊங்கில்லை கேடு. (௩௰௨ — 32)

There is no greater gain than virtue. No surer path to ruin than its neglect. [Kural Number 32, Thirukkural]

Whenever we talk about the idea of Reinforcement Learning in the context of machine learning, we tend to put too much focus on the optimal control part. This involves designing optimal solutions using value functions and dynamic programming and exploring the Markov decision processes. From there, we usually jump straight into the Monte Carlo methods, temporal difference learning algorithms, and so on. This is also how some of the most popular and widely used books on Reinforcement Learning are structured.

Almost all popular definitions of Reinforcement Learning that are available, from Wikipedia to popular tech journals, are designed like this!

Some Examples:

“Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent ought to take actions in a dynamic environment to maximize the cumulative reward.” [Wikipedia]

“A family of algorithms that learn an optimal policy, whose goal is to maximize return when interacting with an environment.” [Google]

“Reinforcement learning is a learning paradigm that learns to optimize sequential decisions, which are decisions that are taken recurrently across time steps, for example, daily stock replenishment decisions taken in inventory control. At a high level, reinforcement learning mimics how we, as humans, learn.” [IBM]

But this is where the problem starts, without exploring the foundational and core ideas behind ‘Reinforcement Learning,’ our understanding of the mathematical models and computational algorithms remains incomplete.

While these concepts are important too in the design and development of RL-based systems we cannot ignore the fact that the very foundational idea behind ‘Reinforcement Learning’ has its roots in behaviour psychology and experiments.

Background

We know that in Reinforcement Learning, which is a reward-based model, the core components are an agent and the environment.

The agent observes the state of the environment and applies an action to the environment to receive a reward. But how those rewards and the schedule of those rewards and reinforcements are designed, is not a mathematical model alone, it’s a psychological concept, which is largely influenced by B.F. Skinner - one of the most radical behaviourists of the 20th century.

BF Skinner and his ‘Skinner Box’

“The real question is not whether machines think but whether men do. The mystery which surrounds a thinking machine already surrounds a thinking man.” — B. F. Skinner, Contingencies of Reinforcement: A Theoretical Analysis (1969).

After receiving a good amount of funding from the U.S. Military to set up ‘Project Pigeon’, a proposal to use living organisms (pigeons in this case) to guide missiles, during his experiments, B.F Skinner realised the power of ‘behaviour shaping’ methods, which in simple words means rewarding an animal for a desired behaviour/response and punishing him for doing the opposite.

Encouraged by the response to his ideas, Skinner continued his experiments to test his ‘ideas of behaviour shaping’ on rats, pigeons, and other testable animals and developed a doctrine called ‘Technology of Behaviour,’ which is also known as a technology of behaviour modification. Later he designed an operant conditioning chamber called the Skinner Box, a soundproof enclosure with a food dispenser that a rat can operate by pressing a lever or a pigeon by pecking a key.

One of the key discoveries from those experiments with the Skinner Box was the ‘schedule of reinforcements/rewards,’ and based on that, he devised different ways of delivering rewards, as per the response rate of animals(pigeons/rats).

He insisted that “a procedure in which behaviour is reinforced or rewarded after scheduled but unpredictable time durations yield the most stable rate of response”. It meant that if an animal knows he will get some rewards for his particular behaviour, but the timing for that is not fixed, it is unpredictable, then he can be tempted to do more, as long as the person who controls the box and lever of rewards wants! This is how all gambling/casino systems and now social media platforms work and that’s why they are highly addictive too.

These Operant conditioning chambers as we know have become common in a variety of research disciplines, especially in animal learning. There are a variety of applications for operant conditioning and Skinner’s studies on animals and their behaviour laid the framework needed for similar studies on human subjects.

In a nutshell, Skinner found that the environment influences behaviour and when that environment is manipulated, behaviour will change.

Skinner’s theory of operant conditioning played a key role in helping psychologists understand how behaviour is learned, which later contributed to the design of Machine Learning algorithms too.

So before starting straight with the development of mathematical models, one must have some understanding of its background, which is not merely based on making an agent learn through ‘trial and error’ but to do some ‘conditioning’ of his behaviour, through the schedule of reinforcement and rewards!

Very few are aware that later B.F. Skinner wrote a sci-fi novel called “Walden Two” (1948) where he visualised a perfectly socially engineered society, working as per his principles of operant conditioning, and he is the Man behind the ‘User Experience’ of the world’s first computer network and the first social media platform (PLATO system) too!
Link to Youtube video - https://www.youtube.com/watch?v=iXglOliGXHc&t=181s

Reinforcement learning Machine learning AI Technology Culture Machine Learning Models

Report

Enjoy this post? Give Devsena Mishra a like if it's helpful.

Devsena Mishra

Full Stack Mentor - Expert in Advanced Technologies/Frameworks

I have 16-plus years of experience in different domains. I have worked on diverse technologies, including Java, Python, JavaScript, React, Node.js, TypeScript, .NET, C/C++, SQL and Oracle. In parallel to training/mentoring profes...

Discover and read more posts from Devsena Mishra

get started