About language, learning and tomatoes

[1] How it all works: Introducing the algorithm
Published: 28 May 2019
Reading rat

In various everyday situations humans make judgments about which of several possible causes is responsible for a particular effect. Is thunder the sound you hear when almighty Zeus calls the lesser Gods to order by banging his fist on the table, or is it caused by two electrically charged regions in the atmosphere equalizing? Is the wine sweet because the grapes grew in clay soil or because Bacchus was merciful?

In a series of three blog posts we will introduce a simple learning principle that can simulate human causal judgements rather well. This example is an adapted and simplified version of the learning problem in causal judgement presented in Spellman (1996). She used a situation with two fertilising liquids (of two different colours: blue and red) and counts of growing vegetables (TOMATO and NO_TOMATO, i.e., no plant whatsoever) if they were given no liquid, one of the two liquids, or both liquids, in a growing pot that is considered to be a constant background (i.e., an ambient or context). The participants’ task was to draw causal inferences from what is, in essence, a \(2 \times 2\) table of possibilities (red liquid: yes/no; blue liquid: yes/no; and two outcomes – tomatoes or no plant at all).

Table 1

TOMATO

Blue Liquid

yes

no

Red

Liquid

yes

no

NO_TOMATO

Blue Liquid

yes

no

Red

Liquid

yes

no

This information can be expanded with counts of how frequently each situation occurred (Table 2).

Table 2

Cues

Outcomes

Frequency

pot; red; blue

TOMATO

3

pot; red; blue

NO_TOMATO

0

pot; red

TOMATO

5

pot; red

NO_TOMATO

2

pot; blue

TOMATO

3

pot; blue

NO_TOMATO

5

pot

TOMATO

0

pot

NO_TOMATO

2

If we expand Table 2 to present each learning trial separately (given the number of repetitions, as specified in the column Frequency), we get Table 3 below.

Table 3

Trial

Cues

Outcomes

1

pot; red; blue

TOMATO

2

pot; red; blue

TOMATO

3

pot; red; blue

TOMATO

4

pot; red

TOMATO

5

pot; red

TOMATO

6

pot; red

TOMATO

7

pot; red

TOMATO

8

pot; red

TOMATO

9

pot; red

NO_TOMATO

10

pot; red

NO_TOMATO

11

pot; blue

TOMATO

12

pot; blue

TOMATO

13

pot; blue

TOMATO

14

pot; blue

NO_TOMATO

15

pot; blue

NO_TOMATO

16

pot; blue

NO_TOMATO

17

pot; blue

NO_TOMATO

18

pot; blue

NO_TOMATO

19

pot

NO_TOMATO

20

pot

NO_TOMATO

In Table 3 identical trials are grouped (i.e., repeated as many times as Frequency indicates). We may want to shuffle them to make each trial appear at random (i.e., in an unsystematic order). If we do this, the information would look similar to Table 4.

Table 4

Trial

Cues

Outcomes

1

pot

NO_TOMATO

2

pot; red; blue

TOMATO

3

pot; red

TOMATO

4

pot; red

NO_TOMATO

5

pot; blue

TOMATO

6

pot; blue

NO_TOMATO

7

pot; red

TOMATO

8

pot; blue

NO_TOMATO

9

pot; red

TOMATO

10

pot; red

TOMATO

11

pot; red; blue

TOMATO

12

pot; blue

TOMATO

13

pot; red; blue

TOMATO

14

pot; blue

NO_TOMATO

15

pot; red

TOMATO

16

pot; blue

NO_TOMATO

17

pot; red

NO_TOMATO

18

pot; blue

TOMATO

19

pot

NO_TOMATO

20

pot; blue

NO_TOMATO

We now have all the ingredients to simulate the process of iterative learning. That is, we have learning events, which consist of input cues and outcomes. The goal of such learning would be to (gradually) estimate how valuable cues are in discriminating (or predicting) each outcome. We could use pencil and paper to calculate this, and we will actually do this for the first three trials from Table 4.

But before we can get started, we need to have our learning principle handy. Our learning principle is the Rescorla-Wagner (1972) rule which learns about the relationship between a cue and an outcome. (An essentially identical learning rule was proposed by Widrow and Hoff in 1960, later known as the Widrow-Hoff rule or the Least Mean Square method in the field of Machine Learning. The Rescorla-Wagner/Widrow-Hoff rule also corresponds closely to the Delta rule which serves as the basis in many connectionist networks.)

The rule defines the state of learning in terms of a learning weight \(w\) (i.e., strength of connection) between cue \(i\) and the outcome \(j\) at any given time \(t\) (thus, we write: \(w_{ij}^{t}\)). Here, time is not continuous as it appears in reality, but it consists of discrete steps – these steps are the learning Trials from the previous tables. The Rescorla-Wagner rule states that the cue-outcome weight at the next time step (\(w_{ij}^{t+1}\)) will be the sum of weights at the current step (\(w_{ij}^{t}\)) plus some change that is due to the current experience, i.e., learning (\(\Delta w^{t}\)):

\[ w_{ij}^{t+1} = w_{ij}^{t} + \Delta w^{t} \]

Learning differentiates only three possible situations and, consequently, three possible changes. These are:

(1)

The cue is absent

nothing happens

\(\Delta w^{t} = 0\)

(2)

The cue is present;
The outcome is present

positive evidence
that should strengthen the connection weight

\(\Delta w^{t} =\gamma(1 - \sum(w_{.j}))\)

(3)

The cue is present;
The outcome is absent

negative evidence
that should weaken the connection weight

\(\Delta w^{t} =\gamma(0 - \sum(w_{.j}))\)

Although the equations above might look a bit intimidating, especially (2) and (3), most of the terms are shared: the free parameter \(\gamma\) – the learning rate (or speed of learning), and the sum of weights for a given outcome \(j\) (\(\sum{w_{.j}}\)) present at the current time step are part of both (2) and (3). The learning rate, typically, takes on a very small value (e.g., 0.01 or even 0.001). This ensures that adjustments are subtle and learning is gradual. Large values would trigger large adjustments. Somewhat simplified, a large value would imply that a single event can dramatically change predictions regardless of the amount of prior experience. Animals and humans do not do make drastic changes to their predictions but (more or less) trust their experience: if it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.

The difference between (2) and (3) is that if both cue and outcome are present and the evidence is positive then you subtract the sum of weights (\(\sum{w_{.j}}\)) from 1. Otherwise, when the cue is present, but the outcome is not and you obtain negative evidence, you need to correct for the error in predicting a given outcome (recall that our learning is error-driven) and, thus, you subtract the sum of weights (\(\sum{w_{.j}}\)) from 0.

Keep in mind that the machinery in the original Rescorla-Wagner rule is a bit more complicated. Instead of only one free parameter \(\gamma\) there are two: \(\alpha_{i}\) is the salience of a cue, while \(\beta_{1}\) controls the importance of positive evidence and \(\beta_{2}\) controls the importance of negative evidence. Alphas and Betas form a multiplicative term \(\alpha_{i} \times \beta_{1}\) or \(\alpha_{i} \times \beta_{2}\), depending on whether we have a case of positive (use \(\beta_{1}\)) or negative (use \(\beta_{2}\)) evidence. Additionally, in almost all (if not all) applications authors assume that the two betas are equal (\(\beta_{1} = \beta_{2}\)), which then means that the product of the two weights (free parameters \(\alpha\) and \(\beta\): \(\alpha \times \beta\)) can be treated as a single free parameter. This parameter we introduced above as the learning rate – \(\gamma\).

Finally, in the original Rescorla-Wagner model, the subtraction in (2) \(1 - \sum{w_{.j}}\) is expressed more generally as \(\lambda - \sum{w_{.j}}\), which means that the maximal strength of association (i.e., the weight) can in fact take any value. For Rescorla and Wagner it was important to keep the learnability of an outcome flexible and, potentially, different from outcome to outcome. In most applications this is less of an issue and authors typically simplify to \(\lambda = 1\). This also makes all outcomes comparable because they are normalized.

We are now ready to use the Rescorla-Wagner learning rule to calculate how we might learn to predict which fluid (or maybe a combination of both fluids) is most likely to yield tomatoes.


Petar/Dagmar



References:

Spellman, B. A. (1996). Conditionalizing causality. In D. R. Shanks, K. Holyoak, & D. L. Medin (Eds.), Causal learning (pp. 167-206). San Diego, CA: Academic Press.

Rescorla, R. A., & Wagner, R. A. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. In H. Black & W. F. Proksay (Eds.), Classical conditioning II (pp. 64-99). New York, NY: Appleton-Century-Crofts.

Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. In WESCON Convention Record Part IV (pp. 96-104). New York, NY: Institute of Radio Engineers. [Reprinted in 1988: J. A. Anderson & E. Rosenfeld (Eds.), Neurocomputing: Foundations of Research (pp. 126-134). Cambridge, MA: MIT Press.]