About language, learning and tomatoes
In various everyday situations humans make judgments about which of several possible causes is responsible for a particular effect. Is thunder the sound you hear when almighty Zeus calls the lesser Gods to order by banging his fist on the table, or is it caused by two electrically charged regions in the atmosphere equalizing? Is the wine sweet because the grapes grew in clay soil or because Bacchus was merciful?
In a series of three blog posts we will introduce a simple learning principle that can simulate human causal judgements rather well. This example is an adapted and simplified version of the learning problem in causal judgement presented in Spellman (1996). She used a situation with two fertilising liquids (of two different colours: blue and red) and counts of growing vegetables (TOMATO and NO_TOMATO, i.e., no plant whatsoever) if they were given no liquid, one of the two liquids, or both liquids, in a growing pot that is considered to be a constant background (i.e., an ambient or context). The participants’ task was to draw causal inferences from what is, in essence, a \(2 \times 2\) table of possibilities (red liquid: yes/no; blue liquid: yes/no; and two outcomes – tomatoes or no plant at all).
Table 1


This information can be expanded with counts of how frequently each situation occurred (Table 2).
Table 2 Cues  Outcomes  Frequency 
pot; red; blue  TOMATO  3 
pot; red; blue  NO_TOMATO  0 
pot; red  TOMATO  5 
pot; red  NO_TOMATO  2 
pot; blue  TOMATO  3 
pot; blue  NO_TOMATO  5 
pot  TOMATO  0 
pot  NO_TOMATO  2 
If we expand Table 2 to present each learning trial separately (given the number of repetitions, as specified in the column Frequency), we get Table 3 below.
Table 3 Trial  Cues  Outcomes 
1  pot; red; blue  TOMATO 
2  pot; red; blue  TOMATO 
3  pot; red; blue  TOMATO 
4  pot; red  TOMATO 
5  pot; red  TOMATO 
6  pot; red  TOMATO 
7  pot; red  TOMATO 
8  pot; red  TOMATO 
9  pot; red  NO_TOMATO 
10  pot; red  NO_TOMATO 
11  pot; blue  TOMATO 
12  pot; blue  TOMATO 
13  pot; blue  TOMATO 
14  pot; blue  NO_TOMATO 
15  pot; blue  NO_TOMATO 
16  pot; blue  NO_TOMATO 
17  pot; blue  NO_TOMATO 
18  pot; blue  NO_TOMATO 
19  pot  NO_TOMATO 
20  pot  NO_TOMATO 
In Table 3 identical trials are grouped (i.e., repeated as many times as Frequency indicates). We may want to shuffle them to make each trial appear at random (i.e., in an unsystematic order). If we do this, the information would look similar to Table 4.
Table 4 Trial  Cues  Outcomes 
1  pot  NO_TOMATO 
2  pot; red; blue  TOMATO 
3  pot; red  TOMATO 
4  pot; red  NO_TOMATO 
5  pot; blue  TOMATO 
6  pot; blue  NO_TOMATO 
7  pot; red  TOMATO 
8  pot; blue  NO_TOMATO 
9  pot; red  TOMATO 
10  pot; red  TOMATO 
11  pot; red; blue  TOMATO 
12  pot; blue  TOMATO 
13  pot; red; blue  TOMATO 
14  pot; blue  NO_TOMATO 
15  pot; red  TOMATO 
16  pot; blue  NO_TOMATO 
17  pot; red  NO_TOMATO 
18  pot; blue  TOMATO 
19  pot  NO_TOMATO 
20  pot; blue  NO_TOMATO 
We now have all the ingredients to simulate the process of iterative learning. That is, we have learning events, which consist of input cues and outcomes. The goal of such learning would be to (gradually) estimate how valuable cues are in discriminating (or predicting) each outcome. We could use pencil and paper to calculate this, and we will actually do this for the first three trials from Table 4.
But before we can get started, we need to have our learning principle handy. Our learning principle is the RescorlaWagner (1972) rule which learns about the relationship between a cue and an outcome. (An essentially identical learning rule was proposed by Widrow and Hoff in 1960, later known as the WidrowHoff rule or the Least Mean Square method in the field of Machine Learning. The RescorlaWagner/WidrowHoff rule also corresponds closely to the Delta rule which serves as the basis in many connectionist networks.)
The rule defines the state of learning in terms of a learning weight \(w\) (i.e., strength of connection) between cue \(i\) and the outcome \(j\) at any given time \(t\) (thus, we write: \(w_{ij}^{t}\)). Here, time is not continuous as it appears in reality, but it consists of discrete steps – these steps are the learning Trials from the previous tables. The RescorlaWagner rule states that the cueoutcome weight at the next time step (\(w_{ij}^{t+1}\)) will be the sum of weights at the current step (\(w_{ij}^{t}\)) plus some change that is due to the current experience, i.e., learning (\(\Delta w^{t}\)):
\[ w_{ij}^{t+1} = w_{ij}^{t} + \Delta w^{t} \]
Learning differentiates only three possible situations and, consequently, three possible changes. These are:
(1)
 The cue is absent
 nothing happens
 \(\Delta w^{t} = 0\)

(2)
 The cue is present;
 positive evidence
 \(\Delta w^{t} =\gamma(1  \sum(w_{.j}))\)

(3)
 The cue is present;
 negative evidence
 \(\Delta w^{t} =\gamma(0  \sum(w_{.j}))\)

Although the equations above might look a bit intimidating, especially (2) and (3), most of the terms are shared: the free parameter \(\gamma\) – the learning rate (or speed of learning), and the sum of weights for a given outcome \(j\) (\(\sum{w_{.j}}\)) present at the current time step are part of both (2) and (3). The learning rate, typically, takes on a very small value (e.g., 0.01 or even 0.001). This ensures that adjustments are subtle and learning is gradual. Large values would trigger large adjustments. Somewhat simplified, a large value would imply that a single event can dramatically change predictions regardless of the amount of prior experience. Animals and humans do not do make drastic changes to their predictions but (more or less) trust their experience: if it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.
The difference between (2) and (3) is that if both cue and outcome are present and the evidence is positive then you subtract the sum of weights (\(\sum{w_{.j}}\)) from 1. Otherwise, when the cue is present, but the outcome is not and you obtain negative evidence, you need to correct for the error in predicting a given outcome (recall that our learning is errordriven) and, thus, you subtract the sum of weights (\(\sum{w_{.j}}\)) from 0.
Keep in mind that the machinery in the original RescorlaWagner rule is a bit more complicated. Instead of only one free parameter \(\gamma\) there are two: \(\alpha_{i}\) is the salience of a cue, while \(\beta_{1}\) controls the importance of positive evidence and \(\beta_{2}\) controls the importance of negative evidence. Alphas and Betas form a multiplicative term \(\alpha_{i} \times \beta_{1}\) or \(\alpha_{i} \times \beta_{2}\), depending on whether we have a case of positive (use \(\beta_{1}\)) or negative (use \(\beta_{2}\)) evidence. Additionally, in almost all (if not all) applications authors assume that the two betas are equal (\(\beta_{1} = \beta_{2}\)), which then means that the product of the two weights (free parameters \(\alpha\) and \(\beta\): \(\alpha \times \beta\)) can be treated as a single free parameter. This parameter we introduced above as the learning rate – \(\gamma\). Finally, in the original RescorlaWagner model, the subtraction in (2) \(1  \sum{w_{.j}}\) is expressed more generally as \(\lambda  \sum{w_{.j}}\), which means that the maximal strength of association (i.e., the weight) can in fact take any value. For Rescorla and Wagner it was important to keep the learnability of an outcome flexible and, potentially, different from outcome to outcome. In most applications this is less of an issue and authors typically simplify to \(\lambda = 1\). This also makes all outcomes comparable because they are normalized. 
We are now ready to use the RescorlaWagner learning rule to calculate how we might learn to predict which fluid (or maybe a combination of both fluids) is most likely to yield tomatoes.
Petar/Dagmar
References:
Spellman, B. A. (1996). Conditionalizing causality. In D. R. Shanks, K. Holyoak, & D. L. Medin (Eds.), Causal learning (pp. 167206). San Diego, CA: Academic Press.
Rescorla, R. A., & Wagner, R. A. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In H. Black & W. F. Proksay (Eds.), Classical conditioning II (pp. 6499). New York, NY: AppletonCenturyCrofts.
Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. In WESCON Convention Record Part IV (pp. 96104). New York, NY: Institute of Radio Engineers. [Reprinted in 1988: J. A. Anderson & E. Rosenfeld (Eds.), Neurocomputing: Foundations of Research (pp. 126134). Cambridge, MA: MIT Press.]