AI is only useful if people trust it. To win user trust, outputs need to be accurate and minimize inaccuracies. While new tools like DALL-E and ChatGPT have made the cultural zeitgeist, so have concerns about whether the use of AI tools could result in misinformation guiding decision-making. For example, ChatGPT often confidently answers prompts with false information without flagging the information as wrong. If ChatGPT is integrated into a search engine (e.g., Microsoft Bing) and generates false information, trust in the search engine will evaporate, and ChatGPT will cease to be useful.
Summarization is a very different use case to more mainstream generative AI applications that focus on creativity. At a minimum, accuracy is required to make a summarization product useful. Without accuracy, a summary has no value. However, current large language models have an architecture that generates text in a probabilistic manner, meaning each block of characters (commonly referred to as a token) is predicted based on a probability score. Large language models are trained on massive data sets that give the model a signal as to what text string should come next in the sequence. Generally speaking, the larger the model, the higher level of detail its connections between concepts and words can be and the greater likelihood that it will produce a correct prediction. However, because each prediction is fundamentally rooted in the model’s training data, that dataset will heavily influence the output. Using probability to string together tokens makes large language models exceptionally good at creativity and constructing well-structured, correct-sounding text. But this makes errors very hard to detect. In addition, these models are so large they require massive datasets to train, costing millions of dollars to do so, and taking many months to update. Thus updating them frequently is challenging, and therefore, information contained within the models is not always accurate or up-to-date.
It is worth noting that this article was written right in the middle of an explosion of AI technology breakthroughs. Microsoft, Google and others have started experimenting with retrieval-augmented generation techniques which are designed with the goal of improving factual accuracy. In some ways, this is an admission that current model architectures need supplementary guardrails to help with accuracy but realistically, only time will tell whether this solves the problem.
Here’s a rudimentary example of the problem. Prompting a best-in-class model with: “Who is the President of the United States?” produces “The President of the United States is the head of state and head of government of the United States. The President is elected to a four-year term by the people.” as a result.
Aside from the fact that the answer is not what the prompt was asking for, the interesting phenomenon here is to look at the prediction probabilities of the first word.
You’ll notice from the graphic that “The” is the highest probability of the 1st word prediction at 93.88%, however, “Donald” is 2.10% and “Joe” is not in the top 5. Once the model had predicted “The President of the United States is”, the next word has a 7% probability of “Donald” and “Joe” isn’t in the top 5. However, Presidents “Barack” and “George” are.
This article was written in early-2023 - Joe Biden is the current president. For the purpose of this example, admittedly we used a model trained on older data, but, until very recently, if you asked the same question of some of the best models publicly available, the model would answer: “Donald Trump” - a fact that was true, but no longer is. Now extrapolate this problem to news or sports content where facts are changing all the time.
Models predict things based on data they’ve been trained on. That data is hard to update in real-time, models are expensive to re-train, and they take a long time (up to several months) to complete. At Summari, we believe to maintain accuracy, you must assume errors are guaranteed to occur, regardless of how convincing the output reads. By starting from this first principle, it’s imperative to design systems around the core AI models that can solve the problem. These other systems need to both detect and fix errors, or flag them if there’s no certainty of a fix being successful.
The illusion of simplicity surrounding current products is dangerous as it leads users to trust results that aren’t accurate. This is not to say these products aren’t useful for some domains - quite the opposite in fact - but for those requiring accuracy, there is great risk.
On the surface, summarization is a subjective field. However, it should be generally accepted that a good summary is both concise and informative. It should be written in a way that maximizes speed of communication of core ideas and allows a reader to grasp what they need to make a decision to go further or not. There are 2 types of summarization: (1) extractive; and (2) abstractive.
Extractive summarization is when a summary consists of text copied directly from the source. This is great for accuracy, but it is also highly susceptible to plagiarism and it’s challenging to convey key themes by simply extracting limited text from the document. Further, the more you extract, the less concise the summary. The less you extract, the less comprehensive and useful the summary. A summary has no value if it is not concise and informative and these are paradoxical to each other with extractive summarization.
Abstractive summarization is when a summary is made of newly generated text designed to synthesize the source. Fortunately, language gives us the tools to convey more information in fewer words, and this type of summarization fulfills the criteria of a good summary by being (1) concise; and (2) informative.
But, by asking a model to produce new text, there is a higher probability you introduce the risk of incorrect statements being written in the summary. There is a balance to strike between usefulness (concise and fulsome), and accuracy (copy paste, but missing information). This is an incredibly difficult problem to solve at scale and it’s what we focus on every day at Summari.
Once a summary is produced and accurate, it needs to be in the right format. A big block of text is not easy to consume and does not fulfill the requirement of maximizing speed of communication. Once a useful format is established (something that is different for different types of content), a model needs to consistently output that style. This requires training a model with as much high-quality data as possible.
There is another balance to navigate here: (1) more (high-quality) training data is better; vs (2) the difficulty and expense of creating that data. Creating consistent, high-quality data requires meticulous attention to detail because introducing errors into the data set will introduce signals to the model to replicate those errors at scale. This is a potential area of big differentiation and it’s where we spend a lot of time and energy at Summari.
Luckily, it is possible to fine-tune models by introducing unique datasets to augment the original training. This is very important for summarization, in particular to ensure style and format of outputs is consistent. It also trains the model to reduce errors as much as possible in line with the data it’s trained on: this is why it’s imperative to maintain a consistently high-quality data set. But this is a big investment and is not something that can be outsourced at low-cost like other labeling tasks in AI. There are no shortcuts.
Systems like ChatGPT are exceptional at generating text in the context of its training. However, they do not have the awareness, or the concept, of truth. The data they’re trained on governs its source of truth, and if that data contains bias, then the model will make similar predictions. This has its own set of problems and ethics debates will continue, however, when thinking about practical use cases of this technology, having a way to cross-check, at a minimum, generally accepted facts, is an important feature, but not one that the current architectures can accommodate without additional systems augmenting them.
When summarizing content, there is a source of truth available to reference: the text being summarized should be represented in the most accurate way possible. It should retain facts and not alter context, however, when writing using abstractive summarization, context can shift easily. For example, many words in English have different meanings based on the underlying context like “season”, “pool”, or “letter”. Sometimes a proper noun is the same as an everyday object (e.g., Rose and rose). These trivial examples illuminate the problem in simple terms, but there are a lot more that are a lot harder to find and fix.
In addition, bias should be eliminated - a summary should be objective and an accurate representation of the underlying content. The author, or AI, should not introduce any foreign bias to the writing. But if the data a model uses contains bias then its output will also lean, and sometimes accentuate, bias, thus damaging the quality of the output.
Large-language models are amazing technological breakthroughs but focus on solving many different cases. It’s important to adapt and augment them to solve a very particular and high value use case.
With the vast amounts of content available online, and a finite amount of attention to consume it, summarization is more important every day. This problem is especially acute for content publishers who compete for attention and need novel ways to capture it. Human summarization is very time-consuming and expensive so is not a viable option at scale, and AI has its challenges. But, with the growing demand for short-form content, in an attention-deficient world, delivering an at-scale solution will help people consume more with less.
There are no shortcuts to quality. Creating massive high-quality datasets takes time, patience, and resources. Identifying common errors is easy, but it’s only after millions of examples that you find the more obscure, but equally important, errors that need fixing. Some errors are easy to fix, but others aren’t, so the challenge is both a quantity problem and a difficulty problem. These models will improve over time, and our expectation is that the next few years will see huge advances, however, most errors can be catastrophic to the usability of any output and without a robust system that improves over time, current models will fall short. For creative tasks, that works well. For non-creative tasks, a solution must be found.
At Summari, we are blown away by AI’s ability to be creative, and we’re so excited about what’s to come. The work done by pioneers like OpenAI is accelerating development of future-defining technology, but the usability of this technology is still in its infancy. Summari is building technology on top of work done by large language owners combined with extensive processing systems of checks and guardrails to detect, fix or flag errors to build trust with the end user, with the ultimate goal of providing content that is useful.
Contact us to learn more, we’d love to chat.