Scarlet & Grey
Ohio State University
School of Music

Klaus Scherer

Notes by Joshua Veltman

Music 829
May 11, 2001

Notes on selected articles by Klaus R. Scherer (and collaborators) on Vocal Affect Expression

Selected articles:

Theoretical foundations:

Scherer and Oshinsky (1977). Synthesized tone sequences, whose major acoustic parameters had been systematically manipulated, were rated on scales of pleasantness, activity, and potency as well as on various emotion scales. The table below (reproduced from Scherer and Oshinsky (1977), p. 340), summarizes the results.

Acoustic Parameters of Tone Sequences Significantly Contributing to the Variance of Attributions of Emotional States

Rating Scale Single acoustic parameters (main effects) and configurations (interaction effects) listed in order of predictive strength
Pleasantness Fast tempo, few harmonics, large pitch variation, sharp envelope, low pitch level, pitch contour down, small amplitude variation (salient configuration: large pitch variation plus pitch contour up)
Activity Fast tempo, high pitch level, many harmonics, large pitch variation, sharp envelope, small amplitude variation
Potency Many harmonics, fast tempo, high pitch level, round envelope, pitch contour up (salient configurations: large amplitude variation plus high pitch level, high pitch level plus many harmonics)
Anger Many harmonics, fast tempo, high pitch level, small pitch variation, pitch contours up (salient configuration: small pitch variation plus pitch contour up)
Boredom Slow tempo, low pitch level, few harmonics, pitch contour down, round envelope, small pitch variation
Disgust Many harmonics, small pitch variation, round envelope, slow tempo (salient configuration: small pitch variation plus pitch contour up)
Fear Pitch contour up, fast sequence, many harmonics, high pitch level, round envelope, small pitch variation (salient configurations: small pitch variation plus pitch contour up, fast tempo plus many harmonics)
Happiness Fast tempo, large pitch variation, sharp envelope, few harmonics, moderate amplitude variation (salient configurations: large pitch variation plus pitch contour up, fast tempo plus few harmonics)
Sadness Slow tempo, low pitch level, few harmonics, round envelope, pitch contour down (salient configuration: low pitch level plus slow tempo)
Surprise Fast tempo, high pitch level, pitch contour up, sharp envelope, many harmonics, large pitch variation (salient configuration: high pitch level plus fast tempo)

Further observations:

Scherer's main interest appears to be speech and not music. This 1977 study used tone sequences as an approximation of speech, perhaps because systematic manipulation of voice samples was too difficult at that time. The remaining studies concern themselves primarily with speech.

Human beings are quite good at determining each others' emotional states. When asked to identify the emotional states of paid actors on the basis of vocal utterances alone (constructed nonsensically from common Indo-European phonemes), experimental participants achieve an average accuracy of about 50% across all emotion categories (although some like joy are higher and some like disgust are lower). This might not seem very accurate, but it is significantly better than the performance expected by chance (i.e. random guessing). When combined with other channels of emotional communication (such as facial expression and posture), vocal cues are clearly a powerful interpretive tool. (For experimental reports, see the 1991, 1996a, and 2001 papers.)

Emotional communication involves both an encoding and a decoding process. On the encoding side, an expression may arise as a spontaneous reaction (as, for example, when one grunts "uggghh" upon tasting something horrible), as an intentional communication of one's emotional state, or as some combination of both. The fact that emotional expressions may arise from various mixtures of sponaneity and intentionality should not be a concern since in all cases the mechanisms of expression have evolved to communicate adaptive information to other organisms. Thus Scherer et al. (2001) conclude that there is no essential difference between using "naturally occuring" emotions and using actors' portrayals of emotions in studies of this nature. On the decoding side, the fact that we are able to discriminate emotions on the basis of hearing alone suggests that, at least in theory, we should be able to identify specific configurations of acoustical parameters for each emotion. On the other hand, the 50% accuracy rate also suggests that the acoustic configurations are subtle and may overlap considerably, probably due to overlaps in the psychological and physiological processes underlying the emotions themselves.

In the 1986b article, Scherer laments the diverse methodologies and somewhat equivocal results of vocal emotion research up to that time. In order to provide a coherent foundation for future empirical research, he proposes a series of speculative but rigorously logical hypotheses on the basis of the following model of vocal affect expression:

cognitive appraisal --> psychophysiological changes --> changes to the vocal production apparatus --> acoustical changes

Along the way he proposes a new model of emotion, the component process model, which can be seen as an expansion of the appraisal theory first proposed by Magda Arnold in 1960. A key ingredient of this model is a sequence of stimulus evaluation checks, summarized in the table below:

Sequence of stimulus evaluation checks (SECs)
first proposed in (Scherer 1986b); shown below as summarized in Scherer (1989).

The component process model "proposes specific changes in the various subsystems of the organism which are seen to subserve emotion (physiological responses, motor expression, motivational tendencies, subjective feeling states). Thus, the outcome of each check is seen to affect all the different emotion components in a 'value-added' function. Given that the organism constantly evaluates and reevaluates ongong stimulation on the basis of these checks, one can expect constant modifications of the state of the various subsystems on the basis of the sequences of changes in the outcomes of the checks" (1989). In other words, emotional states are not static, but in constant flux as the appraisal process moves through its various components; nevertheless, the particular "pathway" through the SECs should leave a tell-tale trace on the final outcome of the organism's physiological state and hence on the acoustic parameters of the vocal utterance.

The table below presents the predicted acoustical changes for various emotions.

Changes predicted for selected acoustic parameters [by emotion]










cold anger

hot anger



F0: Perturbation < or = >     > >   >   >    
F0: Mean < > > >< <> > > >> >< >< < >
F0: Range < or = >     < >   >> < >    
F0: Variability < >     < >   >> < >    
F0: Contour < >     < > > >> < =   >
F0: Shift regularity = <           <   < >  
F1 mean < < > > > > > > > > > >
F2 mean     < < < < < < < < < <
F1bandwidth > >< << < <> << < << << << < <
Formant precision   > > > < > > > > >   >
Intensity: Mean < > > >> << >   > > >> <>  
Intensity: Range < or = >     <     > > >    
Intensity: Variability < >     <     >   >    
Frequency range > > > >> > >>   >> > > >  
High-frequency energy < <> > > >< >> > >> >> >> >< >
Spectral noise         >              
Speech rate < >     < >   >>   >    
Transition time > <     > <   <   <    

By way of example, Scherer hypothesizes that the speech of a person experience grief or desperation is characterized by increases in the perturbation, mean, range, variability, contour, and shift regularity of the fundamental frequency; the first formant mean as well as the formant precision should increase, while the second formant mean should decrease and the first formant bandwidth should decrease markedly. The mean intensity should increase, the frequency range and amount of high-frequency energy should increase markedly, and the rate of speech should increase (with a concomitant decrease in transition time, i.e. time lag between utterances).

The ambitious 1996a study was designed specifically to test the predictions made in (1986b), as summarized in the chart above. Many of these predictions are supported, although some need to be revised on the basis of empirical contradiction. (No such succinct revision has yet been offerred; see Scherer (1996a) for experimental results.) As mentioned above, consistent error patterns are also instructive since they may suggest certain similarities in underlying processes among emotions.

The study of emotional cues in human speech can provide a solid framework for the exploration of emotional cues in music. The various acoustic parameters related to speech can be adapted to music: fundamental frequency relates to pitch and melody, intensity relates to dynamics, the formants relate to timbre, and speech rate / transition time relates to rhythm / duration.

Cross-cultural validity?

The 2001 paper reports the results of an emotion encoding/decoding experiment carried out simultaneously in Germany, Switzerland, Great Britian, the Netherlands, the United States, Italy, France, Spain, and Indonesia. The data show an overall accuracy of 66% across all emotions and countries, suggesting the existence of similar inference rules from vocal expression across cultures. However, accuracy generally decreased with increasing language dissimilarity from German, suggesting that culture- and language-specific paralinguistic patterns may influence the decoding process. (Portions excerpted from the abstract.)

Potential criticisms

Some of the predictions (in 1986a) about the underlying physiological mechanisms as determinants of acoustic parameters may have been susceptible to post hoc reasoning, i.e. proposing mechanisms after the fact to match some already-observed acoustic correlates in the literature reviewed. (Scherer says as much himself, but feels that this potential problem cannot be avoided.)

Scherer has deferred the study of suprasegmental factors (e.g. prosodic cues such as intonation, rhythm, and timing) until such time as (segmental) acoustic parameters have been exhaustively studied. Perhaps this bottom-up approach risks missing or downplaying essential features of emotional communication.

This document is available at