Adapting content for AI: Preparing and repairing our content for RAG

6 min readAug 13, 2024

When you create a retrieval augmented generation (RAG) solution, you must analyze the format and structure of the content to determine what preprocessing needs to be done. However, the content for many RAG solutions is a doc set that is actively being updated, in the process of being created, or still in the planning stage. If the content for your RAG solution is dynamic, then it can be adapted to be more accessible to generative AI models.

You can improve results from RAG solutions by preparing your content for AI and repairing content that produces bad answers from AI.

Adapting content for generative AI can compensate for tooling limitations or even replace the need for some types of processing. Content teams at IBM tested their documentation with LLM question and answer prompts. We started by compiling a list of common user questions for each area. We found the content that answered the question and put both the content and the question into a prompt. If the generated answer was incorrect or inadequate, we updated the content until we got better answers. Then we compiled what we learned into a set of guidelines. We also examined all the answers generated by our RAG solution and if the answers were incorrect or inadequate, we opened issues to update our content and created more guidelines.

The guidelines we created fell into two main categories:

Preparing content for AI by changing how we write
Repairing the content that produces bad answers from AI

Preparing and repairing content improves results from RAG solutions

Preparing content for AI

Some of the ways that we updated our writing style to accommodate AI include:

Simplify or change tables
Explain conceptual graphics in text
Add summaries to tutorials or long procedures
Clearly introduce lists
Consolidate or eliminate very short topics

Notice that many of these changes to how we write benefit our human readers as well as LLMs.

Tables

Even with a preprocessing script that converted tables to sets of bulleted lists, we found times when LLMs could not understand the content in tables. Our guidelines:

Simplify complex table structure. Restructure the table so that it does not have spanned cells, add headings to all columns, and add values to every cell in the first column.
Add text to blank cells. For example, if the LLM does not understand a table that contains cells with checkmarks to indicate “yes” and empty cells to indicate “no”, then add the word “No” to the empty cells.
Change symbols to text. For example, if the LLM does not understand checkmarks, replace them with a word like “Yes” or “Supported”.
Rewrite footnote text to add context so that it can stand on its own.
Move complex information in table cells out of the table.

Conceptual graphics

By clearly explaining conceptual graphics in text, we clarified ambiguities in the graphics and avoided the expense of an image to text model. We noticed is that a graphic can overly simplify a concept because it omits information or does not clearly designate which items are optional. It’s very easy to create a graphic that depicts a process or concept that you understand. It’s much harder to create a graphic that adequately explains the process or concept to someone who doesn’t already know it. By explaining a process or concept in text as if we don’t have the graphic, we prevent confusion for our readers and the LLM.

Long tutorials and procedures

A long procedure or multi-page tutorial can be too long to return in LLM output. When we added a summary of the main steps before the lengthy procedure or tutorial, LLMs returned the summary instead of a partial set of steps. The summaries also helped set expectations for readers.

Lists

Without a clear introductory sentence before a list of items, LLMs can have trouble understanding the purpose of the list. Introduce a set of bulleted items with a sentence that references the list with phrases like “the following items”. For example, “You can choose from the following supported foundation models”. Introduce steps with and imperative phrase that references the task. For example, “To create a prompt”.

Very short topics

Very short topics are HTML files that are less than 3KB in size. That’s about 5 sentences of text. A topic that small probably does not have enough information to produce adequate answers. Google stopped indexing tiny topics because they don’t provide enough information and our RAG solution will omit these topics from the vector database. How to address a tiny topic depends on whether it is necessary to the structure of your documentation set or has content that must be returned by search.

A topic is necessary in the table of contents for one of these reasons:

Consistent structure: The topic makes the TOC structure consistent. However, if your structure results in many tiny topics, consider rearchitecting.
Ease of navigation: The topic has many child topics that are easier to find or skip with a parent topic.

A topic has content that must be searchable for one of these reasons:

Useful content: The topic has content that can answer questions.
Unique content: The topic has content that does not appear in any other topic.

Based on those criteria, our guidelines say:

If a topic is both needed in the table of contents and in search results, either add useful content to the topic or consolidate it with other topics.
If a topic is needed in the table of contents but not in search results, then do nothing.
If a topic is not needed in the table of contents but is needed in search, then consolidate it with other topics.
If a topic is not needed in the table of contents or in search, then delete the topic.

Repairing content for AI

After we implemented our question and answer solution, we monitored the questions that people asked, evaluated the answers produced by the LLM, and determined whether to update to the content to produce better answers.

Based on user questions that did not produce adequate answers, we adapted content to solve the following types of problems:

Missing content
Mismatched terminology
Unclear concepts
Incomplete steps

Missing content

We have added content to answer questions that were not previously covered or sufficiently explained. Sometimes people ask about functionality that’s close to what our product supports, but not exactly the same. By adding a sentence that explains both what we do and do not support, the LLM not only answers the question correctly (that the functionality is not supported), but provides additional information that might be helpful.

Mismatched terminology

Although we can’t control what questions people ask or the terminology that they use, we have mentioned common alternative terms in our content so that the LLM can provide informative answers.

Unclear concepts

We have clarified definitions and concepts to improve LLM answers. Some writing style issues, such as wordiness, passive construction, and pronouns with unclear referents, can make text harder to understand. We solve these problems by eliminating unnecessary words, writing in second-person active voice, and replacing the pronoun at the beginning of the sentence with the appropriate noun.

Incomplete steps

An LLM can return a partial set of steps from a procedure if the steps are disjointed or inconsistent. For example, if two steps have a lot of reference information in between that describes the possible options for the first step, the LLM might not identify the second step as being part of the procedure. The same thing can happen if the steps do not have parallel construction and follow the same format. By removing excessive reference information between steps and making the sentence structure of each step parallel, we’ve improved the completeness of LLM answers.

Conclusion

Examining our content through the lens of AI consumption has given us the chance to reevaluate our style guidelines and our underlying assumptions on how our content is being used. We haven’t finished preparing our content for AI, so we don’t have metrics on how much of those changes improve the quality of LLM answers. However, we know that we’re improving the quality of our content for all readers. And we’ve definitely improved LLM answers with the repairs to content that produced bad answers.

If your content is dynamic, involve your content creators in your RAG solution. At the very least, implement a feedback loop from LLM answers to content creators.

Previous post on this subject: Adapting content for AI: Improving accuracy of RAG solutions.