• Issue #67

“Certainly!” Here’s why we’re so bad at evaluating AI-generated content

Generative AI render of a robot wearing glasses, a fake mustache, and a baseball cap trying (and failing) to look like a human. It sits across a table from a person dressed in business casual clothing. It looks like they're having a conversation.

This Hot Take was written by Kristin Van Dorn

If you follow academic publishing news, you already know that the industry is under intense scrutiny after publishing content that is obviously AI-generated.

Articles with sentences in them, such as “Certainly, here is a possible introduction for your topic:…” and “I’m very sorry, but I don’t have access to real-time information or patient-specific data, as I am an AI language model” are popping up in headlines and memes alike. These mistakes are not just obvious red flags that the content was written by AI. They showcase just how bad we are at reviewing AI-generated content. After all, these academic articles supposedly went through the most rigorous peer review stages and editing in publishing.

This controversy has led some amateur academic publishing watchdogs to search Google Scholar for common AI phrases. So, I performed the same action myself and got 802 hits for academic articles containing the phrase, “Certainly, here is…” Sure, many of the first hits were about AI and large language models (LLMs) generally. But it didn’t take me long to discover a published paper that did not mention the subject of AI with the phrase in it. I downloaded the paper and checked the context. While I cannot absolutely guarantee it was an artifact of AI-generated content, it definitely seemed suspicious.

Oh, are you just like me? Did you just perform a Google Advanced search for “Certainly, here is…” on .edu pages?

Again, you’ll start off with a list of hits about AI and LLMs. But, after the top page or two of results, you’ll come across instances where colleges and universities are not even taking 30 seconds to skim their AI-generated results before dumping those results into their CMS.

I am not here to shame anyone. So, I won’t name names. But this is not how living, breathing content strategists talk to students:

  • “Certainly! Here’s a sample layout for a Sustainability News & Events page” (this text preceded a live news and events feed).
  • “Certainly, here’s a comprehensive breakdown of optimizing your LinkedIn profile for maximum effectiveness” (this text was part of a live career services page).

Okay, so I’ll just do a global search for “Certainly.” Problem solved?

While these common LLM copy/paste mistakes are the easiest to identify and resolve, they’re not quite at the heart of the issue. When evaluating content from LLMs, the trick to determining the content’s effectiveness is not just to eliminate the AI conversational artifacts. You have to really READ the content.

Before we get into why we’re so bad at evaluating AI-generated content, I want to explore three ideas with you:

  • How our brains are trained to skim
  • How editing human-generated content is different from AI-generated content
  • The curse of knowledge bias

We are all trained to skim.

As a desk worker in the 21st century, you are inundated with information temptation. You might have dozens of tabs open and several applications running at once that you flit back and forth between like a website butterfly. This is because all of that information represents more written words than you can possibly get through during each day.

According to Maryanne Wolf, author of “Reader, Come Home: The Reading Brain in a Digital World,” humans are developing digital brains that prefer skimming through content over deep reading. It helps us to lock onto important information without having to spend expensive cognitive resources to support our working memory. This seems like a normal adaptation. The trouble is that skimming is a habit-forming behavior that reduces our ability for deep reading.

The editing process has changed.

When we edit content for ourselves or a colleague – or any human being, really – we’re mostly in copyediting mode. The process of copyediting is complex. It identifies errors in grammar, spelling, punctuation, and syntax. Really good copyediting will address issues in repetition, rhythm, and flow. Maybe a copy editor will call out hyperbolization or statements that require more research, source citation, or a serious fact check.

When humans are writing, we make all of these kinds of mistakes at once. We’re screwing up grammar. Our flow is bad. We’re repeating ourselves while drafting paragraph-long sentences like Virginia Woolf.

The thing is that LLMs are based on rules and predictions. So, there are certain skills they are really good at. They can mimic our flow. They don’t make common grammatical errors or spelling mistakes. The syntax feels mostly smooth, even if it’s dry and basic.

They struggle with crafting meaning. Humans craft meaning through similes and metaphors. We intuit when to write about big-picture abstract ideas and when a specific concrete example will drive the point home. LLMs don’t know the meaning of the phrases behind what they are generating, so they can’t appropriately match metaphors to the context or borrow similar abstractions from other domains. Their sentences and paragraphs often lack meaningful transitions and bridges between ideas because there is no thinker there, only predictions.

We all face the curse of knowledge.

The curse of knowledge is a cognitive bias that happens when we assume that others have the same level of understanding as we do on a subject. This bias is discussed a lot in education because as people develop their expertise, they forget what it was like when they didn’t have the basic building blocks of knowledge in their field. So, they have difficulty deconstructing complex ideas and explaining them simply to novices.

It doesn’t even feel like we’re developing expert knowledge about our institutions, but we are. We have fully mapped mental models of our programs, our processes, and our policies. Just as it’s difficult to explain all of those intersecting programs and processes to novices new to higher education or our particular institution, it’s equally difficult to identify when our content lays out facts in a disjointed and illogical way. Our minds just fill in the gaps or reorder the meaning. We can get the “gist” because our gist is built on years of accumulated patterns.

Put skimming, editing, and the curse of knowledge together.

When we get the gist of something, our brain signals, “I’ve gotten the important information I need. Skimming is all that’s necessary from this point on.” Once you’re in skim mode, you’ll lose focus in editing. Maybe you’ll spot-check a sentence here or there for grammar or syntax, but these are some of the things that AI is really good at. So, you’re likely to see all green flags. This is how we often approve AI content without ever really reading it closely.

You might not think this is such a big deal. But it’s actually a really big problem. If prospective students read AI-generated gobbledegook, they’re not getting what they need out of it. They’re not learning how to apply, what experiences they will get out of a degree, and how that translates into a meaningful livelihood.

If the content doesn’t make sense to them, it won’t be compelling. In fact, it will probably break your brand promise - which means all the hard work you’ve done to build salience and reputation will be for nothing.

Worst of all, it could fundamentally reduce trust. Our institutions are playing defense right now. We have to convince many stakeholders that our colleges and universities are worth investing in. And if our arguments are not abundantly clear and effective—or worse, if they are inaccurate or confusing—we’ve only given our stakeholders evidence to the contrary.

Good writing is clear, easy to follow, and easy to act on. AI is good at imitating, which means that the best writing it produces is unspecific at best and possibly harmful

What are the solutions?

Look, I am not here to tell you to stop using generative AI. However, I do want to encourage you to be incredibly cautious when you do. You should have a generative AI policy. You should grapple with the ethics and expenses of cheap AI in terms of environmental impact, intellectual property theft, and data security risks.

But, if you’ve done all that, and you’re convinced generative AI is necessary or more efficient for creating clear content and a more effective user experience, here are things you can do to get better at AI-generated content evaluation:

  1. Hire for editing instead of writing. Writing and editing will always be entangled, but that does not mean they need to be embedded together in the same job descriptions. Editing and writing are different skills. There’s no sense in insisting you need a generalist writer/editor when you’re planning on or are at least comfortable with outsourcing the writing to AI. If that’s the direction you choose to go, what you really need is someone to constantly, vigilantly evaluate and edit AI content.
  2. Get deliberate. Editing AI-generated content requires thoughtful, deliberate effort. So commit to that, offer professional development in deliberate editing practices, and staff for it. AI is possibly a shorter cut, but it’s not a complete “no human needed” efficiency. It demands that we rethink our skill mixes and efforts.
  3. Test your content. You should be testing your content regularly, no matter who writes it. But this just gives me another reason to say it: test your content with your actual audiences. Collect and act on their feedback. What is clear to you is not necessarily clear to them because you know too much. You can’t be your own audience.