Ordinary Meaning, Extraordinary Methods: Judge Newsom Digs Deeper Into AI In The Courtroom
Earlier this year, we wrote about how Judge Kevin Newsom, on the Eleventh Circuit Court of Appeals, had explored how ChatGPT might actually be useful for a particularly narrow use in a court. Specifically, in judging whether or not the “ordinary meaning” of a phrase matched with what a party in the court argued was the “ordinary meaning” of a phrase.
Newsom was quite thoughtful and careful in his analysis, highlighting the potential risks and limitations. However, he noted that given that ChatGPT is trained across a broad selection of content, it might actually be useful to see whether or not it agreed that the “ordinary meaning” of a particular term (in this case “landscaping”) covered how it was used in the case at hand.
Judge Newsom has continued to think about this and has now released another concurring opinion, which he notes is a sort of “sequel” to the original. Again, it explores an area where an LLM might actually help a judge puzzle through something. This case also involves an issue of “ordinary meaning” — this time about what “physical restraint” means.
Judge Newsom notes that having yet another case regarding an issue of ordinary meaning allowed him to tackle an issue that many people had raised regarding his original: which is that LLMs may give different results each time you ask.
The LLM research that I conducted in connection with this case, though, raised a question that I hadn’t squarely confronted in Snell, one that seemed worth flagging and exploring: What should we make of the fact that the models sometimes provide subtly different answers to the exact same question? I’ll confess that the variation initially spooked me, but it now (for reasons I’ll get into) seems not only unremarkable but perhaps even expected. And significantly from an ordinary-meaning perspective, it accurately reflects real people’s everyday speech patterns.
First, he notes that “physically restrained” may be even tougher to define than “landscaping” because it’s two words, and thus not something that is directly defined in a dictionary. The response to that (which is how the court actually ruled in this case) was to break up the term into its component parts, sort through the meaning of each word, and put it all back together.
But perhaps ChatGPT could provide a different approach, given that it is designed to “capture and reflect how real people ordinarily use both words and multi-word phrases.” Judge Newsom notes that he couldn’t help but experiment, and shares his query and response.
Me: “What is the ordinary meaning of ‘physically restrained’?”
ChatGPT: The ordinary meaning of “physically restrained” refers to the act of limiting or preventing someone’s movement by using physical force or some kind of device. This can involve holding someone down, using handcuffs, straps, or other mechanisms to restrict their ability to move freely. The term implies an active and physical intervention to control or confine a person’s movements.
Not wanting to trust it just because it agreed with his priors, he decided to also ask Anthropic’s Claude (which remains my favorite LLM):
For reasons I’ve explained before, though, I didn’t want to seize on ChatGPT’s explanation just because it matched my priors. See Snell, 102 F.4th at 1225 (Newsom, J., concurring). So, I asked Claude3 the same question: “What is the ordinary meaning of ‘physically restrained’?” Reassuringly, Claude’s response largely mirrored ChatGPT’s:
The ordinary meaning of “physically restrained” generally refers to:
1. The act of using physical force or methods to limit or control someone’s movements or actions.
2. A state in which a person is prevented from moving freely or acting as they wish due to physical barriers or interventions. This can involve various forms of restraint, such as:
- Holding someone back with one’s hands or body
- Using devices like handcuffs, straps, or restraining jackets
- Confining someone to a specific space, like a locked room
The term is often used in contexts related to law enforcement, medical settings, or situations where someone’s freedom of movement needs to be restricted for safety or security reasons. However, it can also be used in more general situations where physical limitation is applied.
But then he tried again and noticed it gave a different response:
Then, for reasons I don’t specifically recall—but that can presumably be chalked up to a “better safe than sorry” instinct—I asked Claude the exact same question again. To my surprise, its answer the second time around was ever-so-slightly different:
The ordinary meaning of “physically restrained” typically refers to:
Using physical means to limit or prevent someone’s movement or actions.
This usually involves:
- Holding someone back
- Using restraining devices (e.g. handcuffs)
- Confining someone to a limited space
It implies direct physical intervention to restrict a person’s freedom of movement.
As he points out, while the underlying point is the same, the actual answers are slightly different:
Hmm. Claude’s responses were basically the same—but not exactly, not verbatim. The substance seemed pretty much identical, but the style and structure were just a smidgen off. Most obviously, the first was longer and more detailed than the second. For instance, unlike the first response, the second didn’t include a “state in which” explanation. In the “holding someone back” bullet, the first specified a means—“with one’s hands or body”—while the second didn’t. Similarly, in the “devices” bullet, the first referred to handcuffs, straps, and restraining jackets as examples, while the second referred only to handcuffs. Finally, the first response concluded with a short paragraph about the various “contexts” in which the phrase “physically restrained” is often used—law enforcement, medicine, etc.—while the second closed with a more generic statement that the phrase “implies direct physical intervention to restrict a person’s freedom of movement.”
Judge Newsom wondered if this created problems for his idea of using LLMs in this manner. Specifically, he worried if this stochastic output meant that the LLM wasn’t “accurately communicating” what its corpus of knowledge suggested was an “ordinary meaning.”
So he did more experimentation. He ran the same queries ten times each on ChatGPT, Claude, and Gemini (using the freely available models of each). With 30 results across three different engines, he wondered if he might be able to learn something, including whether or not he could trust these kinds of answers if they all seemed to resolve to a similar underlying meaning.
Again reassuringly, the 30 results I received—10 apiece from each of the three leading LLMs—largely echoed the initial response that I got from ChatGPT. If you’re interested in the nitty gritty, all the responses are available in the Appendix. But here’s the gist: When defining “physically restrained,” the models all tended to emphasize “physical force,” “physical means,” or “physical barriers.” ChatGPT and Claude specifically used one (or more) of those phrases in every one of their responses. For whatever reason, Gemini was a little different. It didn’t invariably employ one of those terms explicitly, but even when it didn’t, the concept of what I’ll call corporeality (via either human touch or a tangible object) pervaded and tied together its example-laden answers.
To be sure, the models’ responses exhibited some minor variations in structure and phrasing. ChatGPT’s answers, for example, tended to fluctuate in length by a sentence or two. For its part, Claude altered the number of examples it provided from one response to the next. But for reasons I’ll explain in the next part, these subtle, marginal divergences were probably (and should have been) expected. Far more importantly, I think, the responses did coalesce, substantively, around a common core—there was an objectively verifiable throughline. For our purposes, what matters is that the LLMs consistently defined the phrase “physically restrained” to require the application of tangible force, either through direct bodily contact or some other device or instrument. And that, again, squares comfortably with the results obtained through the traditional, dictionary-driven breaking-and-repiecing method.
Newsom concludes that all of this makes him less worried about the lack of direct repeatability among engines, because, if anything, it makes it seem almost more human.
So, what to make of the slight variations among the answers that the models returned in response to my query? For present purposes, I think there are two important points. First, there’s a technical explanation for the variation, which, upon reflection, doesn’t much concern me—or, upon further reflection, even much surprise me. Second, there is, upon even further reflection, a sense in which the substantively-identical-and-yet-marginally-different answers (perhaps ironically) underscore the models’ utility in the ordinary-meaning analysis—namely, in that they pretty closely mimic what we would expect to see, and in fact do see, in everyday speech patterns.
As he explains later in the concurrence, you would expect the same variations if you just asked a bunch of people:
Remember, our aim is to discern “ordinary meaning.” Presumably, the ideal gauge of a word’s or phrase’s ordinary meaning would be a broad-based survey of every living speaker of American English—totally unrealistic, but great if you could pull it off. Imagine how that experiment would go: If you walked out onto the street and asked all umpteen million subjects, “What is the ordinary meaning of ‘physically restrained’?”, I think I can confidently guarantee that you would not get the exact same answer spit back at you verbatim over and over and over. Instead, you’d likely get a variety of responses that differed around the margins but that, when considered en masse, revealed a common core. And that common core, to my way of thinking, is the ordinary meaning.
Thus, the “problem” of variability in answers might not even be really a problem at all.
So, as it turns out, the very thing that had initially given me pause—namely, that the LLMs were returning subtly different responses to the same question—has instead given me (more) hope that the models have something significant to offer the interpretive enterprise. The fact is, language is an organic thing, and like most organic things, it can be a little messy. So too, unsurprisingly, are our efforts to capture its ordinary meaning. Because LLMs are trained on actual individuals’ uses of language in the real world, it makes sense that their outputs would likewise be less than perfectly determinate—in my experience, a little (but just a little) fuzzy around the edges. What’s important, though—and I think encouraging—is that amidst the peripheral uncertainty, the LLMs’ responses to my repeated queries reliably revealed what I’ve called a common core.
Before people freak out, he’s quite clear that he’s not suggesting this replace human judgment or that this is the be-all end-all of any such “ordinary meaning” determination:
A final coda: No one should mistake my missives for a suggestion that AI can bring scientific certainty to the interpretive enterprise. As I’ve been at pains to emphasize, I’m not advocating that we give up on traditional interpretive tools—dictionaries, semantic canons, etc. But I do think—and increasingly so—that LLMs may well serve a valuable auxiliary role as we aim to triangulate ordinary meaning.
And he leaves himself open to the most human of responses:
Again, just my two cents. I remain happy to be shouted down.