Rhetoric's Outliers in Second Language Writing: A Corpus-Based Study

Jay Jordan

Rhetoric's Outliers in Second Language Writing | Jay Jordan

Methods

My approach is an example of what Douglas Biber (2015) broadly called a "corpus-based" one, in which analysis of data from a given corpus assumes the a priori presence and validity of already familiar linguistic features and identifies and makes sense of their appearance and variation. For instance, a study of the use of modal verbs in a collection of research articles might focus on the extent to which such verbs are used to hedge claims, or it might focus on variations in the verbs' appearance across specific fields—both objectives predicated on the conceptual stability and given-ness of "modal verbs." A "corpus-driven" approach, by contrast, would not assume the prior definition of linguistic forms as a basis for analysis/comparison, going so far perhaps as considering groups of words to be "lexical bundles" that may appear in consistently close proximity to one another irrespective of grammatical relationships instead of considering them to be exemplary phrases, clauses, or sentences. For instance, topic modeling has emerged as a popular corpus-driven example of what literary scholar Franco Moretti (2005) memorably termed "distant reading." In topic modeling, algorithms identify statistically significant co-occurrences of words in a given corpus—collections of words that may not otherwise cohere according to traditional notions of "aboutness" (Archer, 2016) but that may reveal semantic connections hidden to traditional textual/thematic analysis. (See, e.g., Blei, 2012; Goldstone & Underwood, 2012.)

I can definitely see promising applications of topic modeling to this corpus, but for my purposes in this first pass through it, I am interested in following the methodological model that Jason Palmeri and Ben McCorkle (2017) employed. To historicize the evolution of new-media-related studies in the long-running English Journal, the authors balanced quantitatively derived visualizations of instances of media-related vocabulary over a 100-year span with closer reading of articles and related materials. As they argued, distant reading productively "raise[s] more questions than answers"—questions about why topical patterns emerge and circulate that require human analytical judgments. I am also guided by Matthew L. Jockers and Ted Underwood's (2015) definition of text mining, the exploration of hidden or occluded patterns in a dataset that may help generate a map for the kind of focused qualitative analysis Palmeri and McCorkle (2017) deployed or else/in addition lead to larger-scale quantitative work. Thus a further stage of my project might, for instance, expand my focus on two key journals by comparing textual patterns I find in them to a far broader selection of publications in the field. Given the likely difference in size of corpora, such broader searches for textual patterns would make digitally enabled algorithmic methods not only useful but indispensible.

In this webtext, however, I am most interested in using digital and visual tools to explore patterns of rhetoric's circulation in second language writing scholarship that may have escaped a traditionally and solely close reading of relevant literature. I am also interested in using these tools to display those patterns, which show singular eruptions of rhetorical concepts at the same time that they provide evidence of some consistent themes in the field's definition of rhetoric. Given the prominence of the term/phrase "contrastive rhetoric," I was interested in searching for other explicit appearances of "rhetoric" that follow a similar a priori grammatical pattern, in which the term "rhetoric" functions as either an adjective or noun in a bigrammatic "meaningful sequence" (Weisser, 2016, p. 210).

So, to attempt to capture how "rhetoric" has circulated in the field, I performed a keyword-based search on the term in the Linguistics and Language Behavior Abstracts database, limiting the search to the Journal of Second Language Writing and TESOL Quarterly, the two most prominent journals in second language writing. The search returned 55 results—research articles and reviews dating between 1972 and 2014. I converted digital (.pdf format) copies of all files to simple text (.txt) format and then created a corpus using AntConc 3.4.3 for Linux, a freely available concordancing software package. The corpus totaled 448,000 words. "Rhetoric" appeared 1,341 times, and variations on "rhetoric" (using the wildcard character, "*" at the end of the term) appeared 2,648 times. I then performed a search for two-word clusters with the term "rhetoric" in right position. Unsurprisingly, given the breadth of scholarship following Robert B. Kaplan's late 1960s work, "contrastive rhetoric" was in fact the most common cluster in the corpus, yielding 723 results. "New rhetoric" was second most common, with 77 results. "Intercultural rhetoric" was third most common, with 68 results, all appearing 2005 or later, reflecting a terminological shift from "contrastive rhetoric" among certain scholars since Ulla Connor's (2004) first published use of the term. I also performed a two–word cluster search with the adjective "rhetorical" in left position, which yielded 1,230 results.

Given the large number of results and my initial interest in using the corpus tool to direct my own closer reading, I re-ran each search, limiting the minimum frequency of each bigram per file/article to 10, though I retained the parameter of minimum range at 1: in other words, I was interested in discovering bigrams that might appear a noticeable number of times but only in a small number of articles, which would demonstrate maximum contrast with "contrastive rhetoric"'s wide spread. I excluded results that violated my target pattern of "adjective + 'rhetoric'" or "'rhetorical' + noun"—that is, bigrams including articles, conjunctions, and prepositions—and I examined text around bigrams with "and" to capture possible compound adjectives.

As an example of what resulted from this method, I immediately noticed that "new rhetoric," again, the second most common "adjective + 'rhetoric'" bigram in the corpus, had a range of 8—that is, all occurrences of the term in the entire corpus appeared in only 8 articles. AntConc's concordance plot tool permitted to me to see that 55 occurrences were in a single article, by Sunny Hyon (1996). I then viewed the concordance plot for each bigram result, which allowed me to identify 15 terms appearing at least a majority of times in the corpus in a single article. Four of 15 occur in one article exclusively.

To enhance my ability to visualize the appearance of select bigrams in the corpus, I also used Voyant Tools (voyant-tools.org), a web-based text analysis platform that includes a range of tools to represent concordances and other features of corpora. In particular, I employed the "Trends" tool to plot bigram occurrences over my corpus' time span.