Operant Conditioning
Operant conditioning is the use of consequences to modify the occurrence and form of behavior. Operant conditioning is distinguished from classical conditioning (also called respondent conditioning, or Pavlovian conditioning) in that operant conditioning deals with the modification of "voluntary behavior"
or operant behavior. Operant behavior "operates" on the environment and
is maintained by its consequences, while classical conditioning deals
with the conditioning of respondent behaviors which are elicited by antecedent conditions. Behaviors conditioned via a classical conditioning procedure are not maintained by consequences.[1]
Reinforcement, punishment, and extinction
Reinforcement and punishment,
the core tools of operant conditioning, are either positive (delivered
following a response), or negative (withdrawn following a response).
This creates a total of four basic consequences, with the addition of a
fifth procedure known as extinction (i.e. no change in consequences following a response).
It's important to note that organisms are not spoken of as being
reinforced, punished, or extinguished; it is the response that is
reinforced, punished, or extinguished. Additionally, reinforcement,
punishment, and extinction are not terms whose use are restricted to
the laboratory. Naturally occurring consequences can also be said to
reinforce, punish, or extinguish behavior and are not always delivered
by people.
- Reinforcement is a consequence that causes a behavior to occur with greater frequency.
- Punishment is a consequence that causes a behavior to occur with less frequency.
- Extinction
is the lack of any consequence following a response. When a response is
inconsequential, producing neither favorable nor unfavorable
consequences, it will occur with less frequency.
Four contexts of operant conditioning: Here the terms "positive" and "negative" are not used in their popular sense, but rather: "positive" refers to addition, and "negative" refers to subtraction. What is added or subtracted may be either reinforcement or punishment. Hence positive punishment
is sometimes a confusing term, as it denotes the addition of punishment
(such as spanking or an electric shock), a context that may seem very
negative in the lay sense. The four procedures are:
- Positive reinforcement occurs when a behavior (response) is
followed by a favorable stimulus (commonly seen as pleasant) that
increases the frequency of that behavior. In the Skinner box
experiment, a stimulus such as food or sugar solution can be delivered
when the rat engages in a target behavior, such as pressing a lever.
- Negative reinforcement occurs when a behavior (response) is
followed by the removal of an aversive stimulus (commonly seen as
unpleasant) thereby increasing that behavior's frequency. In the
Skinner box experiment, negative reinforcement can be a loud noise
continuously sounding inside the rat's cage until it engages in the
target behavior, such as pressing a lever, upon which the loud noise is
removed.
- Positive punishment (also called "Punishment by contingent
stimulation") occurs when a behavior (response) is followed by an
aversive stimulus, such as introducing a shock or loud noise, resulting
in a decrease in that behavior.
- Negative punishment (also called "Punishment by contingent
withdrawal") occurs when a behavior (response) is followed by the
removal of a favorable stimulus, such as taking away a child's toy
following an undesired behavior, resulting in a decrease in that
behavior.
Also:
- Avoidance learning is a type of learning in which a certain
behavior results in the cessation of an aversive stimulus. For example,
performing the behavior of shielding one's eyes when in the sunlight
(or going indoors) will help avoid the punishment of having light in
one's eyes.
- Extinction occurs when a behavior (response) that had
previously been reinforced is no longer effective. In the Skinner box
experiment, this is the rat pushing the lever and being rewarded with a
food pellet several times, and then pushing the lever again and never
receiving a food pellet again. Eventually the rat would cease pushing
the lever.
- Noncontingent reinforcement refers to response-independent
delivery of stimuli identified serve as reinforcers for some behaviors
of that organism. However, this typically entails time-based delivery
of stimuli identified as maintaining aberrant behavior, which serves to
decrease the rate of the target behavior[2].
As no measured behavior is identified as being strengthened, there is
controversy surrounding the use of the term noncontingent
"reinforcement".[3]
Thorndike's law of effect
-
Main article: Law of effect
Operant conditioning, sometimes called instrumental conditioning or instrumental learning, was first extensively studied by Edward L. Thorndike (1874-1949), who observed the behavior of cats trying to escape from home-made puzzle boxes.[4]
When first constrained in the boxes, the cats took a long time to
escape. With experience, ineffective responses occurred less frequently
and successful responses occurred more frequently, enabling the cats to
escape in less time over successive trials. In his Law of Effect, Thorndike theorized that successful responses, those producing satisfying consequences, were "stamped in" by the experience and thus occurred more frequently. Unsuccessful responses, those producing annoying consequences, were stamped out and subsequently occurred less frequently. In short, some consequences strengthened behavior and some consequences weakened behavior. B.F. Skinner
(1904-1990) formulated a more detailed analysis of operant conditioning
based on reinforcement, punishment, and extinction. Following the ideas
of Ernst Mach,
Skinner rejected Thorndike's mediating structures required by
"satisfaction" and constructed a new conceptualization of behavior
without any such references. Moreover, Thorndike's work with puzzle
boxes produced no meaningful data to be studied other than a measure of
escape times. So while experimenting with some homemade feeding
mechanisms Skinner invented the operant conditioning chamber
which allowed him to measure rate of response as a key dependent
variable using a cumulative record of lever presses or key pecks.[5]
Operant Conditioning vs Fixed Action Patterns
Skinner's construct of instrumental learning is contrasted with what Nobel Prize winning biologist Konrad Lorenz
termed "fixed action patterns," or reflexive, impulsive, or instinctive
behaviors. These behaviors were said by Skinner and others to exist
outside the parameters of operant conditioning but were considered
essential to a comprehensive analysis of behavior.
In dog training, the use of the prey drive,
particularly in training working dogs, detection dogs, etc., the
stimulation of these fixed action patterns, relative to the dog's
predatory instincts, are the key to producing very difficult yet
consistent behaviors, and in most cases, do not involve operant, classical, or any other kind of conditioning.
While evolutionary processes shaped these fix action patterns, the
patterns themselves remained stable long enough to be shaped by the
long time span necessary for evolution because of their survival
function (i.e., operant conditioning).
According to the laws of operant conditioning, any behavior that is
consistently rewarded, every single time, will extinguish at a faster
rate while intermittently reinforcing behavior leads to more stable
rates of behavior that are relatively more resistant to extinction.
Thus, in detection dogs, any correct behavior of indicating a "find,"
must always be rewarded with a tug toy or a ball throw early on for
initial acquisition of the behavior. Thereafter, fading procedures, in
which the rate of reinforcement is "thinned" (not every response is
reinforced)are introduced, switching the dog to an intermittent
schedule of reinforcement, which is more resistant to instances of
non-reinforcement.
Nevertheless, some trainers are now using the prey drive to train
pet dogs and find that they get far better results in the dogs'
responses to training than when they only use the principles of operant
conditioning which, according to Skinner and his students Keller and Marian Breland (who invented clicker training), break down when strong instincts are at play.[6]
Criticisms
Thorndike's law of effect specifically requires that a behavior be
followed by satisfying consequences for learning to occur. There are,
however, cases in which learning can be shown to occur without good or
bad effects following the behavior. For instance, a number of
experiments examining the phenomenon of latent learning[7][8][9][10]
showed that a rat needn't receive a satisfying reward (food, if hungry;
water, if thirsty) in order to learn a maze; learning that becomes
apparent immediately after the desired reward is introduced. However,
views claiming such research invalidates theories of operant
conditioning are molecular to a fault. If the rat has a history of
"searching behavior" being reinforced in novel environments, the
behavior will occur in new environments. This is especially plausible
in a species which scavenges for food and has thus likely inherited a
propensity for searching behavior to be sensitive to reinforcement.
Behaving during initial extinction trials as the organism had during
reinforcement trials is not proof of latent learning, as behavior is a
function of the history of the individual organism and its genetic
endowment and is never controlled by future consequences. That an
organism continues to respond during unreinforced trials has been
well-established when studying intermittent schedules of reinforcement[11].
A different experiment, in humans, showed that "punishing" the
correct behavior may actually cause it to be more frequently taken
(i.e. stamp it in)[12].
Subjects are given a number of pairs of holes on a large board and
required to learn which hole to poke a stylus through for each pair. If
the subjects receive an electric shock for punching the correct hole,
they learn which hole is correct more quickly than subjects who receive
an electric shock for punching the incorrect hole. This cannot,
however, be accurately described as punishment if it is increasing the
probability of the behavior.
Biological correlates of operant conditioning
The first scientific studies identifying neurons that responded in ways that suggested they encode for conditioned stimuli came from work by Rusty Richardson and Mahlon deLong.[13][14] They showed that nucleus basalis neurons, which release acetylcholine broadly throughout the cerebral cortex,
are activated shortly after a conditioned stimulus, or after a primary
reward if no conditioned stimulus exists. These neurons are equally
active for positive and negative reinforcers, and have been
demonstrated to cause plasticity in many cortical regions.[15] Evidence also exists that dopamine
is activated at similar times. The dopamine pathways encode positive
reward only, not aversive reinforcement, and they project much more
densely onto frontal cortex regions. Cholinergic projections, in contrast, are dense even in the posterior cortical regions like the primary visual cortex. A study of patients with Parkinson's disease,
a condition attributed to the insufficient action of dopamine, further
illustrates the role of dopamine in positive reinforcement.[16]
It showed that while off their medication, patients learned more
readily with aversive consequences than with positive reinforcement.
Patients who were on their medication showed the opposite to be the
case, positive reinforcement proving to be the more effective form of
learning when the action of dopamine is high.
Factors that alter the effectiveness of consequences
When using consequences to modify a response, the effectiveness of a
consequence can be increased or decreased by various factors. These
factors can apply to either reinforcing or punishing consequences.
- Satiation: The effectiveness of a consequence will be
reduced if the individual's "appetite" for that source of stimulation
has been satisfied. Inversely, the effectiveness of a consequence will
increase as the individual becomes deprived of that stimulus. If
someone is not hungry, food will not be an effective reinforcer for
behavior.
- Immediacy: After a response, how immediately a consequence
is then felt determines the effectiveness of the consequence. More
immediate feedback will be more effective than less immediate feedback.
If someone's license plate is caught by a traffic camera for speeding
and they receive a speeding ticket in the mail a week later, this
consequence will not be very effective against speeding. But if someone
is speeding and is caught in the act by an officer who pulls them over,
then their speeding behavior is more likely to be affected.
- Contingency: If a consequence does not contingently
(reliably, or consistently) follow the target response, its
effectiveness upon the response is reduced. But if a consequence
follows the response reliably after successive instances, its ability
to modify the response is increased. If someone has a habit of getting
to work late, but is only occasionally reprimanded for their lateness,
the reprimand will not be a very effective punishment.
- Size: This is a "cost-benefit" determinant of whether a
consequence will be effective. If the size, or amount, of the
consequence is large enough to be worth the effort, the consequence
will be more effective upon the behavior. An unusually large lottery
jackpot, for example, might be enough to get someone to buy a
one-dollar lottery ticket (or even buying multiple tickets). But if a
lottery jackpot is small, the same person might not feel it to be worth
the effort of driving out and finding a place to buy a ticket. In this
example, it's also useful to note that "effort" is a punishing
consequence. How these opposing expected consequences (reinforcing and
punishing) balance out will determine whether the behavior is performed
or not.
Most of these factors exist for biological reasons. The biological
purpose of the Principle of Satiation is to maintain the organism's homeostasis.
When an organism has been deprived of sugar, for example, the
effectiveness of the taste of sugar as a reinforcer is high. However,
as the organism reaches or exceeds their optimum blood-sugar levels,
the taste of sugar becomes less effective, perhaps even aversive.
The principles of Immediacy and Contingency exist for neurochemical
reasons. When an organism experiences a reinforcing stimulus, dopamine pathways in the brain are activated. This network of pathways "releases a short pulse of dopamine onto many dendrites, thus broadcasting a rather global reinforcement signal to postsynaptic neurons."[17]
This makes recently activated synapses able to increase their
sensitivity to efferent signals, hence increasing the probability of
occurrence for the recent responses preceding the reinforcement. These
responses are, statistically, the most likely to have been the behavior
responsible for successfully achieving reinforcement. But when the
application of reinforcement is either less immediate or less
contingent (less consistent), the ability of dopamine to act upon the
appropriate synapses is reduced.
Operant variability
Operant variability is what allows a response to adapt to new
situations. Operant behavior is distinguished from reflexes in that its
response topography (the form of the response) is subject to
slight variations from one performance to another. These slight
variations can include small differences in the specific motions
involved, differences in the amount of force applied, and small changes
in the timing of the response. If a subject's history of reinforcement
is consistent, such variations will remain stable because the same
successful variations are more likely to be reinforced than less
successful variations. However, behavioral variability can also be
altered when subjected to certain controlling variables.[18]
An extinction burst will often occur when an extinction procedure has just begun. This consists of a sudden and temporary increase
in the response's frequency , followed by the eventual decline and
extinction of the behavior targeted for elimination. Take, as an
example, a pigeon that has been reinforced to peck an electronic
button. During its training history, every time the pigeon pecked the
button, it will have received a small amount of bird seed as a
reinforcer. So, whenever the bird is hungry, it will peck the button to
receive food. However, if the button were to be turned off, the hungry
pigeon will first try pecking the button just as it has in the past.
When no food is forthcoming, the bird will likely try again... and
again, and again. After a period of frantic activity, in which their
pecking behavior yields no result, the pigeon's pecking will decrease
in frequency.
The evolutionary advantage of this extinction burst is clear.
In a natural environment, an animal that persists in a learned
behavior, despite not resulting in immediate reinforcement, might still
have a chance of producing reinforcing consequences if they try again.
This animal would be at an advantage over another animal that gives up
too easily.
Extinction-induced variability serves a similar adaptive
role. When extinction begins, and if the environment allows for it, an
initial increase in the response rate is not the only thing that can
happen. Imagine a bell curve.
The horizontal axis would represent the different variations possible
for a given behavior. The vertical axis would represent the response's
probability in a given situation. Response variants in the middle of
the bell curve, at its highest point, are the most likely because those
responses, according to the organism's experience, have been the most
effective at producing reinforcement. The more extreme forms of the
behavior would lie at the lower ends of the curve, to the left and to
the right of the peak, where their probability for expression is low.
A simple example would be a person inside a room opening a door to
exit. The response would be the opening of the door, and the reinforcer
would be the freedom to exit. For each time that same person opens that
same door, they do not open the door in the exact same way every time.
Rather, each time they open the door a little differently: sometimes
with less force, sometimes with more force; sometimes with one hand,
sometimes with the other hand; sometimes more quickly, sometimes more
slowly. Because of the physical properties of the door and its handle,
there is a certain range of successful responses which are reinforced.
Now imagine in our example that the subject tries to open the door and it won't budge. This is when extinction-induced variability
occurs. The bell curve of probable responses will begin to broaden,
with more extreme forms of behavior becoming more likely. The person
might now try opening the door with extra force, repeatedly twist the
knob, try to hit the door with their shoulder, maybe even call for help
or climb out a window. This is how extinction causes variability in
behavior, in the hope that these new variations might be successful.
For this reason, extinction-induced variability is an important part of the operant procedure of shaping.
Avoidance learning
Avoidance training belongs to negative reinforcement schedules. The
subject learns that a certain response will result in the termination
or prevention of an aversive stimulus. There are two kinds of commonly
used experimental settings: discriminated and free-operant avoidance
learning.
Discriminated avoidance learning
- In discriminated avoidance learning, a novel stimulus such as a
light or a tone is followed by an aversive stimulus such as a shock
(CS-US, similar to classical conditioning). During the first trials
(called escape-trials) the animal usually experiences both the CS and
the US, showing the operant response to terminate the aversive US. By
the time, the animal will learn to perform the response already during
the presentation of the CS thus preventing the aversive US from
occurring. Such trials are called avoidance trials.
Free-operant avoidance learning
- In this experimental session, no discrete stimulus is used to
signal the occurrence of the aversive stimulus. Rather, the aversive
stimulus (mostly shocks) are presented without explicit warning stimuli.
- There are two crucial time intervals determining the rate of
avoidance learning. This first one is called the S-S-interval
(shock-shock-interval). This is the amount of time which passes during
successive presentations of the shock (unless the operant response is
performed). The other one is called the R-S-interval
(response-shock-interval) which specifies the length of the time
interval following an operant response during which no shocks will be
delivered. Note that each time the organism performs the operant
response, the R-S-interval without shocks begins newly.
Two-process theory of avoidance
This theory was originally established to explain learning in
discriminated avoidance learning. It assumes two processes to take
place. a) Classical conditioning of fear. During the first
trials of the training, the organism experiences both CS and aversive
US(escape-trials). The theory assumed that during those trials
classical conditioning takes place by pairing the CS with the US.
Because of the aversive nature of the US the CS is supposed to elicit a
conditioned emotional reaction (CER) - fear. In classical conditioning,
presenting a CS conditioned with an aversive US disrupts the organism's
ongoing behavior. b) Reinforcement of the operant response by fear-reduction.
Because during the first process, the CS signaling the aversive US has
itself become aversive by eliciting fear in the organism, reducing this
unpleasant emotional reaction serves to motivate the operant response.
The organism learns to make the response during the US, thus
terminating the aversive internal reaction elicited by the CS. An
important aspect of this theory is that the term "Avoidance" does not
really describe what the organism is doing. It does not "avoid" the
aversive US in the sense of anticipating it. Rather the organism
escapes an aversive internal state, caused by the CS.
- One of the practical aspects of operant conditioning with relation to animal training
is the use of shaping (reinforcing successive approximations and not
reinforcing behavior past approximating), as well as chaining.
Verbal Behavior
-
In 1957 Skinner published Verbal Behavior
a theoretical extension of the work he had pioneered since 1938. This
work extended the theory of operant conditioning to human behavior
previously assigned to the areas of language, linguistics and other
areas. Verbal Behavior is the logical extension of Skinner's ideas, in
which he introduced new functional relationship categories such as
intraverbals, autoclitics, mands, tacts and the controlling
relationship of the audience. All of these relationships were based on
operant conditioning and relied on no new mechanisms despite the
introduction of new functional categories.
Four term contingency
Modern behavior analysis, which is the name of the discipline
directly descended from Skinner's work, holds that behavior is
explained in four terms: an establishing operation (EO), a
discriminative stimulus (Sd), a response (R), and a reinforcing stimulus (Srein or Sr for reinforcers, sometimes Save for aversive stimuli).[19]
Operant Hoarding
Operant Hoarding is a term referring to the choice made by a rat, on a compound schedule called a multiple schedule, that maximizes its rate of reinforcement in an operant conditioning
context. More specifically, rats were shown to have allowed food
pellets to accumulate in a food tray by continuing to press a lever on
a continuous reinforcement schedule instead of retrieving those pellets. Retrieval of the pellets always instituted a one-minute period of extinction
during which no additional food pellets were available but those that
had been accumulated earlier could be consumed. This finding appears to
contradict the usual finding that rats behave impulsively in situations
in which there is a choice between a smaller food object right away and
a larger food object after some delay. See schedules of reinforcement. [20]
See also
References
- ^ The Principles of Learning and Behavior, Fifth Edition, Ed. Michael Domjan
- ^ Tucker, M.,
Sigafoos, J., & Bushell, H. (1998). Use of noncontingent
reinforcement in the treatment of challenging behavior. Behavior
Modification, 22, 529–547.
- ^ Poling, A.,
& Normand, M. (1999). Noncontingent reinforcement: an inappropriate
description of time-based schedules that reduce behavior. Journal of
Applied Behavior Analysis, 32, 237-238.
- ^ Thorndike, E.
L. (1901). Animal intelligence: An experimental study of the
associative processes in animals. Psychological Review Monograph
Supplement, 2, 1-109.
- ^ Mecca Chiesa (2004) Radical Behaviorism: the philosophy and the science
- ^ Breland, Keller & Breland, Marian (1961), The Misbehavior of Organisms, American Psychologist.
- ^ Williams KA. 1924. The reward value of a conditioned stimulus. Univ Calif Publ Psychol. 4:31-55.
- ^ Blodgett HC.
1929. The effect of the introduction of reward upon the maze
performance of rats. Univ Calif Publ Psychol. 4:113-134.
- ^ Elliott MH.
1929. The effect of appropriateness of reward and of complex incentives
on maze performance. Univ Calif Publ Psychol. 4:91-98.
- ^ Tolman EC,
Honzik CH. 1930. Introduction and removal of reward and maze
performance in rats. Univ Calif Publ Psychol. 4:257-275.
- ^ Ferster, C. B., & Skinner, B. F. (1957). Schedules of reinforcement. New York: Appleton-Century-Crofts.
- ^ Tolman EC. 1932. Purposive Behavior in Animals and Men. Meredith Publishing Company.
- ^ [J. Neurophysiol. 34:414-27, 1971]
- ^ [Advances Exp. Medicine Biol. 295:233-53 1991]
- ^ [PNAS 93:11219-24 1996, Science 279:1714-8 1998]
- ^ Michael J.
Frank, Lauren C. Seeberger, and Randall C. O'Reilly (2004) "By Carrot
or by Stick: Cognitive Reinforcement Learning in Parkinsonism," Science
4, November 2004
- ^ Schultz, Wolfram (1998). Predictive Reward Signal of Dopamine Neurons. The Journal of Neurophysiology, 80(1), 1-27.
- ^ Neuringer,
A. (2002). Operant variability: evidence, functions, and theory.
Psychonometric Bulletin & Review, Vol. 9, No. 4.
- ^ Pierce & Cheney (2004) Behavior Analysis and Learning
- ^ Cole, M.R. (1990). Operant hoarding: A new paradigm for the study of self-control. Journal of the Experimental Analysis of Behavior, 53, 247-262.
Further reading
External links
This article is licensed under the GNU Free Documentation License. It uses material from Wikipedia Encyclopedia article "Operant Conditioning"
|