Someone recently was listening to a Brahms’ Violin Concerto and asked the question of whether or not the soloist could be picked out amid the background of the much louder orchestra, when playing tutti. I recently had seen some evidence presented by a speaker that might apply here, and I thought it would be fun to consider the two things together.

Essentially, the problem is the following. You have a very large contingent of instruments playing, with a single instrument playing something different. We would like to hear the soloist, either in harmony with the other parts or distinctly, above the din. Now, there is clearly an absolute volume at which the soloist would be drowned out — we believe this from experience. However, assuming we’re not at that level yet, there’s sort of an interesting problem of differentiating a very loud section of instruments that is often of equal timbre but of higher amplitude.

The cocktail party problem refers to the observation that, when in a crowded room in which people are speaking, by looking at a single speaker, one can often hear what she is saying, despite the distracting, surrounding noise. So the concerto problem is similar in this regard.

A recent speaker had presented some data he had that I think applies here, though not directly. In his experiments with monkeys, he would play a sound and tap the monkey’s wrist. While recording from the primary auditory area of the brain (A1 for those keeping score), he noted the following. If you gave a sound only, it would elicit a certain response in the brain. If you tapped the wrist only, it would elicit a response of a smaller magnitude. However, when you did both the sound and tap simultaneously, you got a huge response which was actually larger than the sum of the individual responses. It’s my understanding that this so-called non-linear summation is due to multimodal sensory feedback that tells the brain that two events are related in time, so basically they are more meaningful. I seem to recall that similar evidence exists when visual stimuli coincide with auditory stimuli also.

Let’s say for a moment that any temporally correlated multimodal input will increase responses in this manner. Let’s also assume that increased neural responses translate into better perception or awareness. If this is true, then the multimodal responses might help explain the observation that when we are reading subtitles on a movie or watching someone’s lips move when they’re speaking, many of us feel like we can hear the words better. In the case of the soloist, perhaps, one way of “amplifying” one’s perception of the sound is to watch the soloist carefully for temporal cues that might correlate with the sounds that might be coming from the instrument. I noticed that I employ this technique any time I am trying to pick different voices out of the orchestra, so at some level I already believe there is a pragmatic value to it. A behavioral experiment could verify/refute this story to some degree. It might well exist in the literature, already.

Some interesting things occur with this strategy that are auxiliary to this story. For instance, when I am watching the concertmaster (first violinist) very carefully, it often sounds like all of the sound of the first violin section is emanating from his instrument! In a sense it’s a remark on the coherence of the section, since there are no discernible temporal incongruities to make that seem ridiculous (think, dubbed English over old Chinese martial arts films). Similarly, I tried to listen to a couple of harps I noticed a few nights ago at a performance of a local university’s symphony. Often when trying to discern parts of a recording or parts of a multi-voice piece, it’s fun to pick out the parts and put them all back together to hear it as one sound. When I pick parts apart, I consciously think about the timbre and pitch range of the instrument and concentrate on it (is this modulating attentional rhythms in cortex!?). I usually have no trouble picking out individual voices/sections of a group and listening to their part, almost isolated. However these harps presented a problem — I think that I was unable to hear them at all because of the absolute volume of the rest of the group. The trouble here is now removed about two levels from the performer. Assume that I am not the only one who had this problem. The conductor should have recognized that the harp is not being perceived and raised its volume or lowered that of the rest of the group. Alternatively, one imagines that the composer wrote the part for a distinct artistic reason and not for his second cousin harpist who couldn’t find a job. These are interesting issues for all persons connected with the music to consider.

With recorded audio, the problem is different, since there are no multimodal (visual, etc.) cues to help “amplify.” Though there are spatial cues that are undoubtedly helpful (violins on the left channel, violas on the right channel, etc.), the problem of separating orchestra from soloist may be difficult. Enter recording engineers. They often have the ability to record multiple tracks simultaneously, in which the orchestra is distinct from the soloist. Later, they can turn the gain on the soloist up or turn the orchestra down and achieve the effect of being able to clearly hear the soloist, if that’s the intended goal. But this does present the interesting artistic problem of whether or not this is true to the original performance. It may not be, but that may not be the goal.

It’s all very interesting to think about. It would not surprise me if evidence existed that showed greater neural responses associated with multimodal listening of orchestras. It might be worth looking intently next time you’re at the symphony.


  1. three years earlier (than you)

    While reading this, a couple instances were instantly called to memory. Almost all of these instances were of what you just described in having auditory and visual stimuli coexisting temporally. I do, however, offer something else to chew on.

    As you well know, I have to Google information in order to validate (if only to myself) how many strings are on a violin, viola, etc. I know very little about instruments and their personalities. It would therefore be factorially (didn’t think exponentially worked in this instance, and we KNOW infinitely is just hyperbole…i was going for accuracy) more difficult for me do the ascribed task.

    I do, however, know what it’s like to be in the middle of a cacophonous section of a restaurant, subway, etc. and be able to hear the lone pair of people having a conversation in English, and, without turning to see who it is, I can isolate their conversation from all other inputs that are trying to make their way in my ear. I always wondered when this would be true with Korean. When have I heard enough, become conditioned enough, have the wherewithal to be able to isolate a particular conversation of my choosing in Korean?

    I also want to say that I enjoy your posts thoroughly. I’m always watching (not in a creepy way). I can’t begin to express how proud I am of you. Keep it up, contact-wearing-man.

