### About language, learning and tomatoes

[1] How it all works: Introducing the algorithm
Published: 28 May 2019

In various everyday situations humans make judgments about which of several possible causes is responsible for a particular effect. Is thunder the sound you hear when almighty Zeus calls the lesser Gods to order by banging his fist on the table, or is it caused by two electrically charged regions in the atmosphere equalizing? Is the wine sweet because the grapes grew in clay soil or because Bacchus was merciful?

In a series of three blog posts we will introduce a simple learning principle that can simulate human causal judgements rather well. This example is an adapted and simplified version of the learning problem in causal judgement presented in Spellman (1996). She used a situation with two fertilising liquids (of two different colours: blue and red) and counts of growing vegetables (TOMATO and NO_TOMATO, i.e., no plant whatsoever) if they were given no liquid, one of the two liquids, or both liquids, in a growing pot that is considered to be a constant background (i.e., an ambient or context). The participants’ task was to draw causal inferences from what is, in essence, a $$2 \times 2$$ table of possibilities (red liquid: yes/no; blue liquid: yes/no; and two outcomes – tomatoes or no plant at all).

Table 1
 TOMATO Blue Liquid yes no Red Liquid yes no
 NO_TOMATO Blue Liquid yes no Red Liquid yes no

This information can be expanded with counts of how frequently each situation occurred (Table 2).

Table 2
 Cues Outcomes Frequency pot; red; blue TOMATO 3 pot; red; blue NO_TOMATO 0 pot; red TOMATO 5 pot; red NO_TOMATO 2 pot; blue TOMATO 3 pot; blue NO_TOMATO 5 pot TOMATO 0 pot NO_TOMATO 2

If we expand Table 2 to present each learning trial separately (given the number of repetitions, as specified in the column Frequency), we get Table 3 below.

Table 3
 Trial Cues Outcomes 1 pot; red; blue TOMATO 2 pot; red; blue TOMATO 3 pot; red; blue TOMATO 4 pot; red TOMATO 5 pot; red TOMATO 6 pot; red TOMATO 7 pot; red TOMATO 8 pot; red TOMATO 9 pot; red NO_TOMATO 10 pot; red NO_TOMATO 11 pot; blue TOMATO 12 pot; blue TOMATO 13 pot; blue TOMATO 14 pot; blue NO_TOMATO 15 pot; blue NO_TOMATO 16 pot; blue NO_TOMATO 17 pot; blue NO_TOMATO 18 pot; blue NO_TOMATO 19 pot NO_TOMATO 20 pot NO_TOMATO

In Table 3 identical trials are grouped (i.e., repeated as many times as Frequency indicates). We may want to shuffle them to make each trial appear at random (i.e., in an unsystematic order). If we do this, the information would look similar to Table 4.

Table 4
 Trial Cues Outcomes 1 pot NO_TOMATO 2 pot; red; blue TOMATO 3 pot; red TOMATO 4 pot; red NO_TOMATO 5 pot; blue TOMATO 6 pot; blue NO_TOMATO 7 pot; red TOMATO 8 pot; blue NO_TOMATO 9 pot; red TOMATO 10 pot; red TOMATO 11 pot; red; blue TOMATO 12 pot; blue TOMATO 13 pot; red; blue TOMATO 14 pot; blue NO_TOMATO 15 pot; red TOMATO 16 pot; blue NO_TOMATO 17 pot; red NO_TOMATO 18 pot; blue TOMATO 19 pot NO_TOMATO 20 pot; blue NO_TOMATO

We now have all the ingredients to simulate the process of iterative learning. That is, we have learning events, which consist of input cues and outcomes. The goal of such learning would be to (gradually) estimate how valuable cues are in discriminating (or predicting) each outcome. We could use pencil and paper to calculate this, and we will actually do this for the first three trials from Table 4.

But before we can get started, we need to have our learning principle handy. Our learning principle is the Rescorla-Wagner (1972) rule which learns about the relationship between a cue and an outcome. (An essentially identical learning rule was proposed by Widrow and Hoff in 1960, later known as the Widrow-Hoff rule or the Least Mean Square method in the field of Machine Learning. The Rescorla-Wagner/Widrow-Hoff rule also corresponds closely to the Delta rule which serves as the basis in many connectionist networks.)

The rule defines the state of learning in terms of a learning weight $$w$$ (i.e., strength of connection) between cue $$i$$ and the outcome $$j$$ at any given time $$t$$ (thus, we write: $$w_{ij}^{t}$$). Here, time is not continuous as it appears in reality, but it consists of discrete steps – these steps are the learning Trials from the previous tables. The Rescorla-Wagner rule states that the cue-outcome weight at the next time step ($$w_{ij}^{t+1}$$) will be the sum of weights at the current step ($$w_{ij}^{t}$$) plus some change that is due to the current experience, i.e., learning ($$\Delta w^{t}$$):

$w_{ij}^{t+1} = w_{ij}^{t} + \Delta w^{t}$

Learning differentiates only three possible situations and, consequently, three possible changes. These are:

 (1) The cue is absent nothing happens $$\Delta w^{t} = 0$$ (2) The cue is present; The outcome is present positive evidence that should strengthen the connection weight $$\Delta w^{t} =\gamma(1 - \sum(w_{.j}))$$ (3) The cue is present; The outcome is absent negative evidence that should weaken the connection weight $$\Delta w^{t} =\gamma(0 - \sum(w_{.j}))$$

Although the equations above might look a bit intimidating, especially (2) and (3), most of the terms are shared: the free parameter $$\gamma$$ – the learning rate (or speed of learning), and the sum of weights for a given outcome $$j$$ ($$\sum{w_{.j}}$$) present at the current time step are part of both (2) and (3). The learning rate, typically, takes on a very small value (e.g., 0.01 or even 0.001). This ensures that adjustments are subtle and learning is gradual. Large values would trigger large adjustments. Somewhat simplified, a large value would imply that a single event can dramatically change predictions regardless of the amount of prior experience. Animals and humans do not do make drastic changes to their predictions but (more or less) trust their experience: if it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.

The difference between (2) and (3) is that if both cue and outcome are present and the evidence is positive then you subtract the sum of weights ($$\sum{w_{.j}}$$) from 1. Otherwise, when the cue is present, but the outcome is not and you obtain negative evidence, you need to correct for the error in predicting a given outcome (recall that our learning is error-driven) and, thus, you subtract the sum of weights ($$\sum{w_{.j}}$$) from 0.

 Keep in mind that the machinery in the original Rescorla-Wagner rule is a bit more complicated. Instead of only one free parameter $$\gamma$$ there are two: $$\alpha_{i}$$ is the salience of a cue, while $$\beta_{1}$$ controls the importance of positive evidence and $$\beta_{2}$$ controls the importance of negative evidence. Alphas and Betas form a multiplicative term $$\alpha_{i} \times \beta_{1}$$ or $$\alpha_{i} \times \beta_{2}$$, depending on whether we have a case of positive (use $$\beta_{1}$$) or negative (use $$\beta_{2}$$) evidence. Additionally, in almost all (if not all) applications authors assume that the two betas are equal ($$\beta_{1} = \beta_{2}$$), which then means that the product of the two weights (free parameters $$\alpha$$ and $$\beta$$: $$\alpha \times \beta$$) can be treated as a single free parameter. This parameter we introduced above as the learning rate – $$\gamma$$. Finally, in the original Rescorla-Wagner model, the subtraction in (2) $$1 - \sum{w_{.j}}$$ is expressed more generally as $$\lambda - \sum{w_{.j}}$$, which means that the maximal strength of association (i.e., the weight) can in fact take any value. For Rescorla and Wagner it was important to keep the learnability of an outcome flexible and, potentially, different from outcome to outcome. In most applications this is less of an issue and authors typically simplify to $$\lambda = 1$$. This also makes all outcomes comparable because they are normalized.

We are now ready to use the Rescorla-Wagner learning rule to calculate how we might learn to predict which fluid (or maybe a combination of both fluids) is most likely to yield tomatoes.

Petar/Dagmar

References:

Spellman, B. A. (1996). Conditionalizing causality. In D. R. Shanks, K. Holyoak, & D. L. Medin (Eds.), Causal learning (pp. 167-206). San Diego, CA: Academic Press.

Rescorla, R. A., & Wagner, R. A. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. In H. Black & W. F. Proksay (Eds.), Classical conditioning II (pp. 64-99). New York, NY: Appleton-Century-Crofts.

Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. In WESCON Convention Record Part IV (pp. 96-104). New York, NY: Institute of Radio Engineers. [Reprinted in 1988: J. A. Anderson & E. Rosenfeld (Eds.), Neurocomputing: Foundations of Research (pp. 126-134). Cambridge, MA: MIT Press.]