Captioning Hypnotoad—A Quest That Helped Define a Field: An Interview with Sean Zdenek

Gregory Zobel

Caption Studies

After nearly 10 years of researching closed captions, and nearly 20 years of watching them, Sean Zdenek was the first person to use the label "caption studies" to describe the field that researches the construction of meaning and use of captions in live and prerecorded events. Nicole Snell (2012) mentioned "captioning studies" five times in her dissertation, but given the limited readership for dissertations, this label did not catch on (pp. vii, 21, 23). Similarly, Zdenek is one of the first individuals to consistently look at the construction of meaning in closed captions used in popular media. While other fields, like literacy and second language instruction, have researched how captioned or subtitled films can support literacy and learning, they have rarely looked at how meaning is created within the captions. (Captions and subtitles are not the same, even though they are often conflated; see Bond, 2014, for a short explanation of the difference between subtitles and closed captions.) Similarly, Zdenek has spent an incredible amount of effort examining and documenting non-speech information (NSI)—sounds that occur within a video but are not speech—and how these sounds impact meaning and understanding of the video. Further, Zdenek's (2015) Reading Sounds is becoming a common reference point for discussions about closed captioning for caption researchers, advocates, and practitioners.

Perhaps Zdenek's greatest contribution to caption studies in Reading Sounds are his four principles of closed captioning and seven transformations of meaning, which I list below for ease of reference. Teasing out their impact, significance, and meaning is largely beyond the scope of this interview. However, readers are encouraged to review Zdenek's presentation of these in Reading Sounds as well as to discuss them with Zdenek via email or Twitter. These principles of closed captioning and transformations of meaning are vital for caption studies because, as an emerging field, there is relatively little theory that has been developed within the field itself. Instead, the field largely has employed methods and theories from outside of captions and captioning. While there is little doubt that Zdenek's principles and transformations will be critiqued, modified, and adjusted over time, they provide an important theoretical and practical foundation for caption studies—a foundation built on over a decade of experience and analyses of 500 media clips.

Four Principles of Closed Captioning (Zdenek, 2015, pp. 2–6)

Every sound cannot be closed captioned.
Captioners must decide which sounds are significant.
Captioners must rhetorically invent and negotiate the meaning of the text.
Captions are interpretations.

Seven Transformations of Meaning (Zdenek, 2015, pp. 8–10)

Captions contextualize.
Captions clarify.
Captions formalize.
Captions equalize.
Captions linearize.
Captions time-shift.
Captions distill.

Reading Sounds is equally important because it provides a touchstone for others working in or around the field of caption studies. It is a book, and that carries a weight with it that differs from an article or two. Fortunately, the book offers an arching presentation and engagement with captions and thus multiple points of potential connection and engagement with readers. This work has already had important impact. For instance, it has helped influence and support work by other captioning researchers, such as the complex and multi-site studies on closed captioning's relationship with Universal Design for Learning and measurable increases in student learning outcomes by 3Play Media and Katie Linder at Oregon State University (see Linder, 2016). Additionally, it provides a readable and accessible text for non-academics, captioning advocates, captioning practitioners, and academic researchers. Thus Zdenek has authored what could become a commonplace or foundational reference point for caption studies. On a more personal level, Reading Sounds inspired me to organize the first ever Caption Studies Conference in North America in 2016.

What is Caption Studies?

In Reading Sounds, Zdenek (2015) defined the field of caption studies as "a research program that is deeply invested in questions of meaning at the interface of sound, writing, and accessibility" (p. 2). To better understand closed captions, captioning, and caption studies, I asked Zdenek to locate his work with Reading Sounds in the larger landscape of captions and captioning. Our interview begins here:

[Sean Zdenek] Captioning is complex. Perhaps this goes without saying today. But we’ve only recently paused to consider the various ways in which that complexity is manifested for both captioners and viewers, especially viewers who are deaf and hard-of-hearing. My research has been concerned with questions of meaning at the interface of sound, writing, and accessibility. When we stop to inquire after the meanings and experiences that captioning enables and forecloses, we move away from a traditional and narrow focus on transcription. While transcription can (and should) be theorized as a deeply rhetorical practice (see Macaulay, 1991), it usually boils down to a simplistic and transparent act of copying down what the captioner hears, with a premium placed on speech sounds over nonspeech sounds. Meaning is assumed to be self-evident, inscribed on the surfaces of the sounds themselves and dutifully recorded by the captioner—so easy, even a machine can do it, or at least that's the promise when we start from the premise that captioning is unreflective transcription.

Captioning is about choices, since not every sound can be captioned (given the constraints of space and time with which captioners work). I would go further and suggest, somewhat controversially, that not every sound should be captioned (given the tendency of writing to blur the distinctions between background and foreground sounds, figure and ground, among other effects). Captioning is also about what happens and what is possible when the soundscape is transformed, ideally, into an accessible form of writing for time-based reading. Captioners must make decisions about what to include in the caption track and how to include it. Nowhere is this more evident or intriguing than in the case of nonspeech sounds, which must be rhetorically described by the captioner. When a nonspeech sound is deemed salient and recurs across multiple episodes of a television program, for example, we can chart the myriad ways in which different captioners have accounted for that recurring sound in writing. I devote Chapter 3 of Reading Sounds to an analysis of how the same buzzing sound is captioned over 10 ten years on a popular animated television show. Put simply, I’m interested in how captioning is both deeply interpretative and subjective.

Captioning is contextual. Meaning doesn't exist outside of the contexts that activate it. What a sound effect means in some technical sense—where it came from, what produces it—may have little or no bearing on what it means in the specific, audiovisual context of a scene. Transcription makes sense for an audio file but not for the multimodal fusion of sounds and moving images on movie and television screens. The history of Foley provides numerous examples of how sound engineers pull film and television listeners away from the original causes of a sound (halved coconut shells) and towards new causes they want us to identify with (horse hooves clopping down the street). The "phenomenon of synchresis," according to Michel Chion (2012), leads us to identify not "the real initial causes of the sounds, but causes that the film makes us believe in" (p. 49). Synchresis is a way of linking sounds to images. In his translator's introduction to Chion's Sound, James Steintrager (2016) wrote, "In the film medium, sight and sound are nonetheless essentially linked, yet they can be decoupled and recoupled in ways that would be unusual and often simply unavailable in everyday conversation and life" (p. xxiii). For this reason, "the purity of sonic experience is impossible because of its interplay with the visual" (p. xxiii). Captioners operate on top of this interplay, interpreting it, distilling it, and redeploying it as a series of written texts.

Professional captioners are rarely members of the production team but rather third-party contractors who are sent the film or program after it has been completed. While they may have access to scripts, cast lists, and other guiding documents, they also exercise immense freedom over the caption track (as numerous professional captioners have told me), particularly when it comes to nonspeech sounds. In short, captioners are outsiders with immense authority to fashion meaning. In my book, I go so far as to suggest that the caption track embodies a new text that potentially shapes a different set of experiences of the program for caption readers.

[gregory zobel] As I've read, and reread, your book, it seems like some of the transformations overlap. For example, the seventh transformation is that captions distill. This is where the captioner pares down captions to the key elements, the most important sounds. I'm not sure how this differs so much from captions contextualizing or clarifying. It seems like distilling is very similar to the first two points.

Or is it, perhaps, that some of these same choices—whether or not to include a sound—could fall under one or multiple kinds of transformations simultaneously? That there is no either/or? So choices of what to caption impact in multiple areas.

[SZ] Reading Sounds attempts to make sense of the complexity and the rhetorical nature of captioning by accounting, in the broadest terms, for how captions work. I identify seven main text–effects (in the spirit of multimodal transduction): captions clarify, contextualize, formalize, equalize, linearize, time-shift, and distill. (For video clips highlighting each term, see the supplemental website for Chapter 1 of Reading Sounds.) Note that these seven transformations are not necessarily the result of conscious decisions on the part of the captioner. Thus, I wouldn't say that "the captioner pares down captions to the key elements" but rather that the caption track is a distillation of the sound track. Captions, not captioners per se, distill meaning.

Of these seven arguments, I would want to start with the popular idea that captions are considered most beneficial to users because they clarify meaning. Hearing and hard-of-hearing viewers report using captions to cut through the clutter of thick accents, indistinct and background speech, mumbled speech, whispered speech, and so on. According to this argument, printed words provide immediate access when the soundtrack is less helpful. When the problem is environmental, such as a noisy space or a space where quiet is mandated or desired, captions can provide immediate clarity here too.

So while I didn't put these arguments into a hierarchy, I would definitely want to start with the clarifying function. It's the go-to argument for captioning advocates. It needs little explanation, at least for hearing and hard-of-hearing viewers who regularly negotiate the twin demands of listening and reading. People seem to understand immediately (at least on Twitter!) that, in certain cases, reading provides superior access over listening alone.

Things get messier from here. The seven arguments are intended to be open-ended and even ambiguous, to function together or separately depending on the critic's focus. I invite other scholars and advocates to take them up as heuristics and to extend them with new examples, contexts, definitions, and critiques. We've never treated captioning in this way before, as a text to be interpreted within a multimodal matrix. The terrain is wide open and new. I hope this framework becomes part of our continuing conversations about accessibility in the humanities.

By design, the seven arguments overlap and cut into each other. Distillation and contextualization have similar emphases, for example. Likewise, equalization can clarify the meaning of background sounds, and time-shifts may be activated through the process of formalizing speech into writing. But each term also directs our attention in different ways. For example, contextualization is concerned with how captions situate meaning. Captions contextualize not by describing sounds in a vacuum but within specific contexts (similar to how synchresis depends on contextualization). Perhaps what is most interesting (and counter-intuitive): some captions don't describe sounds at all but rather explain actions being performed. For example, consider a nonspeech caption from an episode of A Young Doctor's Notebook (Chappell, Connor, Pye, & Hardcastle, 2012) starring Daniel Radcliffe and Jon Hamm. In this example, Radcliffe is turned towards a sink washing his hands. Because we can't see him turn off the sink, the caption alerts us when it happens: (TURNS TAP OFF). This caption is more concerned with the action performed than the specific sonic qualities of the squeaky tap or the splashing sound when it is turned off. The caption reports not on the sound but on the action implied when the water sound is silenced. In another context, the splashing or squeaking sound might be significant and need to be captioned. In short, the meaning of film sound can not be determined outside of a context.

Source: A Young Doctor's Notebook, Episode 1.1 (Chappell et al., 2012). Ovation Network. Featured caption: (TURNS TAP OFF)

Distillation can be a form of contextualization but distillation zeroes in on the time and space constraints that lead captioners to highlight the most important sounds/actions. Reading speed guidelines and a very cramped space work together to reduce the soundtrack to elemental sounds. Alternatively, we know that the soundtrack has not been properly distilled when the captioner puts too much emphasis on the (DOG BARKING IN DISTANCE). One effect of distillation is that ambient sounds tend to be reduced to single captions or not captioned at all. Music is distilled to a simple description and/or to captioned music lyrics.

Captions reconstruct the narrative as a series of elemental sounds. This process also transforms sustained sounds—instrumental music, environmental noise, ambient sounds—into discrete, one-off captions. Consider a tense scene in Terminator 3 (Mostow, 2003) in which the evil terminator (Kristanna Loken) has broken into a veterinarian clinic looking to kill the vet, Kate Brewster (Claire Danes). As Kate confronts John Connor (Nick Stahl), whom she has trapped in a dog cage in one of the exam rooms, the commotion in other areas of the clinic is reduced to a series of elemental sounds/captions: [GLASS BREAKING], [DOGS BARKING], [DOGS BARKING], [WOMAN SCREAMS], [GUNSHOTS], [GASPING]. In this example, the captions construct a narrative out of key sounds: The Terminator breaks a window to gain entry to the clinic, the dogs react, a customer screams before being shot, and Kate gasps when she sees the customer’s body fall. These are the essential moments of the scene, each of which is mapped onto a corresponding caption. Sustained and complex sounds (e.g., dogs barking) are distilled down to discrete (one-off) descriptions. Other ambient sounds are ignored.

Source: Terminator 3: Rise of the Machines (Mostow, 2003). SyFy Channel, 2004. Featured caption: [GUNSHOTS] and other nonspeech sounds