I found out about John R. Anderson almost immediately upon discovering intelligent tutoring systems a few years ago; he and his research group at Carnegie Mellon have blazed the way forward with these technologies. Their Cognitive Tutor, for example, is currently #5 out of 39 interventions in mathematics education, as evaluated by the US Department of Education’s “What Works Clearing House”. I learned that, notwithstanding these educational pursuits, his life’s work had been more about developing a “cognitive architecture” – a model of how the structure of the mind and its components work together to achieve human cognition. I learned that he called it ACT-R (for “adaptive control of thought - rational”) and that it has been steadily undergoing refinements since it debuted in the early 70s. Anyway, given how amazed I was with his tutoring-systems research, I was naturally drawn to Anderson’s 2007 book that surveys his life’s work in attempting to answer the titular question via ACT-R.
I’m moved to blog this because I was extremely impressed by (1) the synthesis of seemingly disparate phenomena (ACT-R is very consistent with a wide range of findings in cognitive psychology), and (2) how well his theories map onto findings from neuroscience. This book contains the most convincing model of human cognition I know of, but it is spread out across several chapters and compartmentalized in such a way that I feel I can unbox everything and tie it all together here in a more readily intelligible, coarse-grained fashion. It really is amazing, but I understand if you don’t want to sit here and read a whole long synopsis. For this reason, I will now post verbatim a summary given by Anderson at the end of the book (though before he talks about consciousness), so that you can make an informed decision about whether to read further.
The Modular Nature of Mind and Brain
The function of a cognitive architecture, according to Anderson, is “to find a specification of the structure of the brain that explains how it achieves the function of the mind.” He argues that connectionist models of cognition will never be able to completely account for human cognition as a whole:
Though many cognitive phenomena are certainly connectionist in nature, there is also no question that the brain is more than a uniform network of individual neurons. Much in the way that a cell is functionality partitioned into organelles, or that an organism comprises interconnected organ systems that each carry out characteristic tasks, the brain too has modularized certain functions, as evidenced by unique regions of neural anatomy associated with the performance of different tasks. The brain isn’t just one huge undifferentiated mass! Neurons that perform related computations occur close together by reason of parsimony: the further apart they are, the longer it would take for them to communicate. Thus, computation in the brain is local and parallel; different regions perform different functions in the service of cognition, though at a lower level the functionality of any given brain region is connectionist in nature. Indeed, almost all systems whose design is meant to achieve a function show this kind of hierarchical organization (Simon, 1962).
If the brain devotes local regions to certain functions, this implies that we should be able to use brain-scanning procedures to find regions that reflect specific activities. The ACT-R cognitive architecture proposes 8 basic modules, and has mapped them onto specific brain regions through a series of fMRI experiments.
The eight modules (four peripheral and four central), plus their associated brain regions, are as follows: (1) Visual - processing of attended information in the fusiform gyrus; (2) Aural - secondary auditory cortex; (3) Manual - hand motor/sensory region of central sulcus; (4) Vocal - face/tongue motor/sensory region of central sulcus; (5) Imaginal - mental/spatial representation area in posterior parietal cortex; (6) Declarative - memory storage/retrieval operations in prefrontal cortical areas; (7) Goal - cognition directed by anterior cingulate cortex; and (8) Procedural - integration, selection of cognition actions through the basal ganglia. A single fMRI study (Anderson et al., 2007) demonstrated the exercise of all of these modules and their associated brain regions. For our purposes, two of these modules are worth considering in more detail.
While the many regions of the brain do their own separate processing, they must act in a coordinated manner to achieve cognition. Thus, many regions of localized functionality are interconnected by tracts of neural fibers; particularly important are the connections between the cortex (the outermost region of the brain) and subcortical structures. One subcortical area in particular, the basal ganglia, is innervated by most of the cortex and plays a major role in controlling behavior through its actions on the thalamus. It marks a point of convergence across brain regions, compressing widely distributed information into what is effectively a single decision point. Thus, the basal ganglia is believed to be the main brain structure involved in action selection, or choosing which of many possible behaviors to perform in a given instance. Like their associated brain regions, the ACT-R modules must be able to communicate among each other, and they do so by placing information in small-capacity buffers associated with each of them. The procedural module plays the role of the basal ganglia by responding to patterns of information in these buffers and producing action. Though all modules are capable of independent parallel processing, they have to communicate via the procedural module, which can only execute a single rule/action at a time, thus forming a serial “central bottleneck” in overall processing.
So the basal ganglia plays the role of a “coordinating module”. Appropriately, this region is evolutionarily older than the cortex and it occurs to some extent in all vertebrates. The other module I wanted to consider is the Goal module, which enables means-ends analysis. This is a task that is more uniquely human; it requires that one be able to disengage from what one wants (the goal, or “end”) in order to focus on something else (the “means”). Some researchers (Papineau, 2001) assert that this is a uniquely human capability.
So, where are we at? The human mind is thought to be partitioned into specific information-processing functions, and thankfully neuroanatomy appears to be cut along similar joints, with specific brain regions devoted to different functions and interconnections that provide for coordination among these functions. Having positive a cognitive architecture based on interacting modules, Anderson turns next to the nitty-gritty of learning and memory.
Learning and Memory in ACT-R
Above, I mentioned a “Declarative” module as being among the central modules posited by ACT-R. Anderson’s fundamental claim is that “declarative memory tries to give us, moment by moment, the most appropriate possible window into our past,” and “this window into our past gives us our identities.”
He assumes the well-documented distinction between declarative learning, or learning of “facts” and procedural learning (skill acquisition). He doesn’t, however, make Tulving’s (1972) episodic/declarative distinction; instead he considers both explicitly learned in a given context, with the difference being that the “declarative” memory (such as “Lincoln was a U.S. president”) has been encountered in so many subsequent contexts that we no longer have access to the context in which it was originally learned. Declarative memories can be strengthened, or made more available, by mere exposure.
In addition to the formation and strengthening of declarative memories, there is also procedural learning and subsequent conditioning of these actions. An example he gives is typing: we all know how to type, but we would have a difficult time if asked to give the location of a certain key on a keyboard (without using our fingers as an aid or relying on a common mnemonic like “the home row” or “qwerty”). Conditioning is how all animals learn that certain actions are more effective in certain situations through experience; these can be procedural actions or innate tendencies. Procedural knowledge is associated with the basal ganglia and will be discussed in greater detail below; for now, we will stay with declarative learning.
Interestingly, there are two ways of acquiring declarative memories. This can be illustrated by anterograde amnesiacs like H.M., who, despite the loss of the hippocampus (and the ability thereby to form new memories), was able to learn about famous people such as John F. Kennedy and others who became famous after his surgery. Recent researchers have postulated two different learning systems: while the hippocampus is known to subserve most declarative learning, other brain structures can slowly acquire such memories through repetition (presumably how H.M. came to know about famous people). Furthermore, through rehearsal, memories can be slowly transferred from the hippocampus to neocortical regions, explaining why those with a damaged or missing hippocampus can still access older memories (which are presumed to have undergone such transfer). So, while the hippocampus limits the capacity of declarative memory, it does not limit all learning.
I’ve long been confused about the relative finitude of memory, but Anderson makes a strong case for there being definite limits on the size of declarative memory. Beyond physical limits of sheer size and metabolic costs, he makes the interesting claim that the very flexibility of our memory-search ability derives from it being strategically limited, “throwing out” memories that are unlikely to be needed: “declarative memory, faced with limited capacity, is in effect constantly discarding memories that have outlived their usefulness”.
Alongside Lael Schooler, Anderson (1991) researched the fundamental mechanisms of declarative memory. They found that if a memory has not been retrieved in a while, it becomes increasingly unlikely that it will be needed in the future. Indeed, there is a simple relationship between how likely a memory would be needed on a given day and how long it had been (t) since the memory was last used:
Odds needed = At-d
Where A is just a constant and d is the decay rate. Each time a memory was accessed, it added an increment to the odds that it would be needed again, with these increments all decaying according to a power function. Thus, if an item occurred n times, the odds of it appearing again is
Odds = ∑nk=1 Atk-d
Where tk is the time since the kth practice of an item. Thus, the past history of memory use predicts the odds that the memory will be needed. But the context of the current situation is involved as well. It turns out that memory availability is adjusted as a function of context; e.g., you will have an easier time remembering, say, your locker combination in the locker room than you would if someone were to randomly ask you for it elsewhere (Schooler and Anderson, 1997). Thus, human memory reflects the statistics of the environment and performs a triage on memories, devoting its limited resources to those that are most likely to be needed. How is this fact realized in ACT-R?
In ACT-R, the “past” that is available in the form of memories consists of the information that existed in the buffers of various modules. At any given moment, countless things are impinging on the human sensorium, of which we only remember a very small fraction. For instance, ambient sounds or things in the visual periphery certainly undergo processing in various brain regions, but they seldom attended to and thus often never make it into buffers. The system is “aware” only of the chunks information in the various buffers, and these chunks get stored in declarative memory. These chunks have activation values that govern the speed and success of their retrieval. Specifically, a given memory has an inherent, base-level activation, plus its strength of association to elements in the present context.
Since the odds of needing a memory can be considered the sum of a quantity that reflects the past history of that memory and the present context, we can represent this in Bayesian terms as
log[prior(i)] + ∑(j∈C)log[likelihood(j|i)] = log[posterior(i|C)]
Where_ prior(i)_ is the base-level activation, or the prior odds that memory i would be needed based on factors such as recency/frequency of use, _likelihood(j|i)_ is the likelihood ratio that element j would be part of the context given that memory i is needed (reflecting strength of association to the current context), and_ posterior(i|C)_ is the updated odds that memory i will be needed in contex C.
I’ll give the basic ACT-R memory equations without going into them much further. The main point is that memory is responding to two statistical effects in the environment: (1) the more often a memory is retrieved, the more likely it is to be retrieved in the future. This produces a practice effect and is reflected in ACT-R’s base-level activation. Secondly, (2) the more memories associated with a particular element, the worse a predictor the element is of any particular memory. This is reflected in the strengths of association in ACT-R, and produces the “fan” effect. The “fan” refers to the number of connections to a given element; increasing the sheer number of connections will decrease the strength of association between the element and any one of its connections. This is because when an element is associated with more memories, its appearance becomes a poorer predictor of any specific fact.
These results have been shown to affect all of our memories. In experimental illustration of this, Peterson and Potts (1982) had participants study 1 or 4 true facts about famous historical figures that they did not previously know, such as that Beethoven never married. Two weeks later, participants were tested on memory for three kinds of facts: (1) new facts they had learned about historical figures as part of the experiment, (2) known facts that they knew about the historical figures before the experiment (eg, Beethoven was a musician), and (3) false facts that they had not learned for the experiment and that should be recognizable as very unlikely (Beethoven was an famous athlete). Participants were shown these types of statements and had to rate them as true or false, and their speed in doing so was recorded. First, it was found that the facts they knew before the experiment were recognized much more quickly than those they learned for the experiment, reflecting the greater practice and base-level activation of the prior facts. More importantly, the number of facts they had learned for the experiment (1 vs. 4) affected BOTH new and prior facts: participants who learned 4 new facts made slower judgements for both well-known and newly-learned facts, while those who learned just 1 new fact were faster on both new and prior facts. Anderson writes:
This relationship is also borne out in fMRI research. The greater the activation of a memory, the less time/effort it will take to retrieve it; thus, higher activation should map onto weaker fMRI response. Using a fan-effect paradigm, it was found that greater fan (more connections to a single memory) resulted in decreased activation and therefore stronger fMRI respones (Sohn 2003, 2005).
Anderson goes on in this chapter to discuss how we often choose actions and make decisions based on our memories of similar past actions/decisions and the outcomes that they produced. Here, we rely on memories rather than reasoning on the basis of general principles. Sometimes we have general principles to reason from, while other times it’s far easier to recall and act. This kind of instance-based reasoning may be far more common than has been traditionally thought.
The Adaptive Control of Thought
Given all of the above, we know how important a flexible declarative memory is to our ability to adapt to a changing environment; but once the relevant information has been retrieved, we have to act on it, using it to make inferences or predictions. This often requires intensive, deliberative processing which is not appropriate when we have to act rapidly in stressful situations. Indeed, to the extent that one can anticipate how knowledge will be used, it makes sense to prepackage the application of that knowledge in a way that can be executed without planning. It turns out that there is a process by which frequently useful computations are identified and cached as cognitive reactions that can be elicited directly by the situation, bypassing laborious deliberation. Thus, a balance must be struck between immediate reaction and deliberative reflection, a sort of dual processing reminiscent of Kahneman’s “Thinking Fast and Slow.” This is the way Anderson conceptualizes learning: a process of moving from intentional thinking and remembering (hippocampal/cortical) to more automatic reactions (basal ganglia).
But such an equal embrace of thought and action has not always characterized cognitive science; in fact, this very distinction marked the transition in psychology from the “behaviorist” to the “cognitive” era. This shift is very visible in the debate between Tolman and Hull about the relative roles of mental reflection and mechanistic action in producing behavior. To illustrate the struggle between thought and action in the mind, Anderson has us consider the Stroop task, where you are instructed to quickly report the font color while reading a list like red yellow orange green blue black etc. This task always takes slightly longer than simply reporting the color of non-words; Anderson points out that “this conflict basically involves the battle between Hull’s stimulus-response associations (the urge to say the word) and Tolman’s goal-directed processing (the requirement to comply with instructions).”
Anderson argues that 3 brain systems are especially relevant in achieving a balance between thought and action: the basal ganglia are responsible for the acquisition and application of “procedures”, or Hull’s automatic reactions; the hippocampal and prefrontal regions are responsible for storage and retrieval of declarative information, or Tolman’s expectancies; and the anterior cingulate cortex (ACC) for exercising control in the selection of context-appropriate behavior. Note that these respectively correspond to the procedural module, the declarative module, and the goal module.
Declarative retrieval and of information during decision-making is very time and resource intensive; it would be sensible if our brains had a way of “hard-coding” frequently-used behaviors/actions so that we could respond more automatically to familiar situations. Fortunately, it appears they do just that! For example, Hikosaka et al. (1999) showed monkeys a sequence of 4x4 grids in which two cells were lit up, and the monkeys had to select them in the correct order. The monkeys practiced such sets over the course of several months, and telling differences emerged between performance during the early months and later months. Early on, the monkeys performed the same regardless of what order the grids were shown in, or of which hand they used; however, after months of practice, they had become much faster at completing the task but could not go out of order and could only use their favored hand to input the answer. Thus, it seemed that the monkeys had switched from a flexible declarative representation of the task to a classic stimulus-response representation. Hikosaka et al. examined the brains of monkeys performing the task in order to compare activity in the early vs. later months. As expected, the task activated prefrontal regions early on, but after much practice the task primarily produced activity basal ganglia structures, which are thought to display a variant of reinforcement learning. Furthermore, temporarily inactivating basal ganglia structures disrupted only the highly practiced sequences (not newly learned sequences).
The basal ganglia, then, is involved in producing automatic responses to stimuli. Indeed, it seems to display a variant of reinforcement learning, where a behavior followed by a “satisfying state of affairs” will increase in frequency (Thorndike’s law of effect). The hippocampus is associated with Hebbian learning, where repeated occurrences of stimuli and response together serve to strengthen the connection (Thorndike’s law of exercise); this is merely a function of temporal contiguity and does not depend on the consequences of the behavior. The basal ganglia is involved in a dopamine-mediated process that learns to recognize favorable patterns of activity in the cortex (Houk and Wise, 1995). That is, dopamine neurons provide information to the basal ganglia about how rewarding a behavior was, if it was more rewarding than expected, etc. Importantly, an element of time-travel is involved, because the rewards strengthen the salience of reward-producing contexual patterns. In humans, the basal ganglia (specifically the striatum) has been found to respond differentially to reward and punishment, the magnitude of the reward/punishment, and the difference between expected and recieved reward/punishment (Delgado et al. 2003). This was all very refreshing to me. Classical and operant conditioning are often presented in psychology classrooms as museum curiosities or animal training procedures, when in fact they apply equally well to human learning.
I wanted to share one final experimental demonstration of the difference between learning in the hippocampus versus the basal ganglia. This one involves a rat maze-learning paradigm; imagine a maze shaped like a plus sign (+); rats always enter on the same side, say the west side. Rats are trained to go to food housed in the south arm. What will rats do if they are put in the maze on the east side? Have they learned the spatial location of the food, or have they merely learned a right-turning behavior? If the former is true, they should turn down the correct arm of the maze to find the food; if they latter is true, their response will lead them down the wrong arm. Early results yielded no clear choice pattern (Restle, 1957). However, Packard and McGaugh (1996) trained all rats on the maze and then gave them injections that temporarily impaired either their hippocampus or their basal ganglia (specifically, the caudate). As you might expect, the rats with selective hippocampal impairment performed the right-turning response and ended up in the wrong arm of the maze, while rats with impediments to the basal ganglia chose the correct arm, presumably because their intact hippocampus contained the correct spatial “place-learning” representation. A convincing follow-up study by Packard (1999) produced the same pattern of results, but this time by using memory-enhancing agents applied selectively to the hippocampus or the caudate. This time, rats with hippocampal enhancements displayed behavior consistent with place-learning (they chose the correct arm), while rats with enhanced caudates relied on a right-turn response and chose the incorrect arm.
But where do these stimulus-response associations come from? In ACT-R, they are called “productions” or “production rules” – when a situation arises for which the system does not already have rules, information must be retrieved from declarative memory and must be processed using more basic production rules. This could entail retrieving a similar prior experience upon which to base present actions or retrieving general principles and reasoning from them. In such a situation,
However, this newly formed production requires multiple repetitions for it to acquire enough strength to be applicable in new situations. Such rules are learned slowly, consistent with the view that procedural memories are acquired gradually. This measure of strength is often called a rule’s “utility” since it is a measure of the value of the rule; when a situation arises where multiple rules apply, the rule with the highest utility is chosen; further, rewarding consequences following the use of a rule serve to increase that rule’s utility. When a new rule is first created, its utility is zero and thus it is extremely unlikely that it will “fire”. However, each time this rule is recreated its utility is increased. Anderson gives an excellent example using children’s learning of subtraction rules. In the interest of time I won’t go into it here, other than to say that it accounts for the most common bug in learning to subtract two multi-digit numbers: instead of always subtracting the bottom number from the top number, the buggy rule children often use is to subtract the bigger from the smaller, regardless of which is on top. This rule is so persistent because half of the time, it produces the correct outcome and thus the same reward as the more limiting bottom-from-top rule. ACT-R is used to model the acquisition of the correct rule, and I found it very compelling.
This general learning process is seen clearly in skill learning: as one becomes more skillful (say, in riding a bike), there will be a decrease in the involvement of the more “cognitive” cortical regions and an increase in the involvement of the more “stimulus-response” posterior regions. Here’s Anderson’s summary:
Thus, an important part of cognition is the accumulation of production rules in long-term memory, which can then become activated by the contents of working memory, which can be composed into more complex production-rule chains when a particular problem is solved, the result of which can be cached and, if used above some some frequency threshold, will become a production rule in its own right.
Uniquely Human Learning
Anderson points out that his (and my) discussion up to this point has actually concerned primate learning; nothing so far has been unique to humans. In chapter 5, he discusses learning from verbal directions and worked-out examples. He also recognizes the role of individual discovery in the learning process, but criticizes the recent trend towards pure “discovery” learning in education:
Anderson goes on to discuss how human cognition can support a uniquely human skill: learning algebra from verbal directions and examples. He uses ACT-R to model algebra learning and to help point the way toward what is special about human cognition. He ends up describing three such features in detail: the potential for abstract control of cognition, the capacity for advanced pattern matching, and the metacognitive ability to reason about cognitive states.
The first is likely mediated by the anterior cingulate cortex (ACC), a structure involved in controlling behavior, which is especially active when people have to direct their behavior in ways that violate typical response tendencies. Interestingly, the ACC has undergone recent evolutionary changes found only in humans. Recall that this structure was the one associated with the goal buffer, which holds control elements. The idea is that the ACC allows us to maintain abstract control states which let us choose different actions when all the other buffers are in identical states. The second feature requires dynamic pattern matching, which allows for processing complex relational structures, as seen in analogical processing. It all gets pretty detailed and I won’t go into it here. Instead I’ll just quote the end of the chapter:
The Question of Consciousness
It isn’t really fair to talk about this here, because I have only given you a flavor for the main arguments presented in the book, and it is upon this foundation that his discussion of consciousness is founded. It requires an intimate understanding of ACT-R, and I don’t think I’ve done a good enough job conveying that understanding in the present post. Still, I’ll leave you with his thoughts on the subject, which he gives only grudgingly (preferring to “leave the philosopher’s domain to the philosopher”):
He immediately notes that this is “not a particularly novel interpretation of consciousness” and that it is essentially “the ACT-R realization of the global workspace theory of consciousness (Baars, 1988; Dehaene & Naccache, 2001)
These authors, Dehaene and Changeux (2004), summarize the view as follows:
He is totally on-board with rejecting all “Cartesian theater” interpretations–the idea that there has to be something more to consciousness, some inner homunculus that watches our thoughts flit by– and he seems to agree pretty completely with Dennett (1993). He finishing with the following: