[S2-E9]LaMDA: Language Models for Dialog Applications

SeaVoice Stories
No Rating

Available Platforms


LaMDA: Language Models for Dialog Applications

We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformerbased neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text.

While model scaling alone can improve quality, it shows less improvements on safety and factual grounding. We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.

The first challenge, safety, involves ensuring that the model’s responses are consistent with a set of human values, such as preventing harmful suggestions and unfair bias. We quantify safety using a metric based on an illustrative set of human values, and we find that filtering candidate responses using a LaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promising approach to improving model safety.

The second challenge, factual grounding, involves enabling the model to consult external knowledge sources, such as an information retrieval system, a language translator, and a calculator. We quantify factuality using a groundedness metric, and we find that our approach enables the model to generate responses grounded in known sources, rather than responses that merely sound plausible.

Finally, we explore the use of LaMDA in the domains of education and content recommendations, and analyze their helpfulness and role consistency

Figure 1: Impact of model pre-training alone vs. with fine-tuning in LaMDA on dialog quality (left), and safety and factual grounding (right). The quality metric (SSI) corresponds to sensibleness, specificity, and interestingness. See Section 4 for more details on these metrics.
1 Introduction
Language model pre-training is an increasingly promising research approach in NLP [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. As pre-training uses unlabeled text, it can be combined with scaling model and dataset sizes to achieve better performance or new capabilities [13]. For example, GPT-3 [12], a 175B parameter model trained on a large corpus of unlabeled text, shows an impressive ability in few-shot learning thanks to scaling. 
Dialog models [14, 15, 16], one of the most interesting applications of large language models, successfully take advantage of Transformers’ ability to represent long-term dependencies in text [17, 18]. Similar to general language models [13], Adiwardana et al. [17] show that dialog models are also well suited to model scaling. There is a strong correlation between model size and dialog quality. Inspired by these successes, we train LaMDA, a family of Transformer-based neural language models designed for dialog.
These models’ sizes range from 2B to 137B parameters, and they are pre-trained on a dataset of 1.56T words from public dialog data and other public web documents (Section 3). LaMDA makes use of a single model to perform multiple tasks: it generates potential responses, which are then filtered for safety, grounded on an external knowledge source, and re-ranked to find the highest-quality response. We study the benefits of model scaling with LaMDA on our three key metrics: quality, safety, and groundedness (Section 4).
We observe that: (a) model scaling alone improves quality, but its improvements on safety and groundedness are far behind human performance, and (b) combining scaling and fine-tuning improves LaMDA significantly on all metrics, and although the model’s performance remains below human levels in safety and groundedness, the quality gap to measured crowdworker levels can be narrowed (labeled ‘Human’ in Figure 1). The first metric, quality, is based on three components: sensibleness, specificity, and interestingness (Section 4).
We collect annotated data that describes how sensible, specific, and interesting a response is for a multiturn context. We then use these annotations to fine-tune a discriminator to re-rank candidate responses. The second metric, safety, is introduced to reduce the number of unsafe responses that the model generates. To achieve this, we define an illustrative set of safety objectives that attempt to capture the behavior that the model should exhibit in a dialog (Appendix A.1), and we use a demographically diverse set of crowdworkers to label responses in multiturn dialogs for these objectives (Appendix A.2, A.3).
We then use these labels to fine-tune a discriminator to detect and remove unsafe responses (Section 6.1). Our work on safety for LaMDA can be understood as a process for AI value alignment, at a high level. The third metric, groundedness, is introduced for the model to produce responses that are grounded in known sources wherever they contain verifiable external world information. Due to neural language models such as LaMDA’s capacity to generalize rather than just memorize, they tend to generate responses that may seem plausible, but actually contradict factual statements made in established sources.
We use this metric for the model to avoid this tendency. While grounding in known sources does not guarantee factual accuracy, it allows users or external systems to judge the validity of a response based on the reliability of its source and its faithful reproduction. We find that augmenting model outputs with the ability to use external tools, such as an information retrieval system, is a promising approach to achieve this goal. Therefore, we collect data from a setting where crowdworkers can use external tools to research factual claims, and train the model to mimic their behavior.
Finally, we explore the use of LaMDA in the domains of education and content recommendations to investigate its potential and shortcomings. Similar to the concept of prompts in GPT-3 [12], we precondition LaMDA on a few turns of application-specific dialog to adapt LaMDA to the target applications. We perform experiments to compare the application-specific helpfulness (i.e., useful and correct responses) and role consistency (i.e., agent utterances match agent role) of pre-training-only and fine-tuned LaMDA models subject to application-specific preconditioning.
We find that both types of models can adapt to their expected application roles fairly well, but fine-tuned LaMDA models are significantly more helpful.
2 Related work Language models and dialog models: Language models have attracted much attention recently thanks to their successes in NLP applications (e.g., [19, 20, 21, 2, 1, 22, 23, 5, 12, 24]). Our study of scaling laws with respect to model sizes is inspired by recent work on the scaling laws of neural language models [12, 13]. Similar to their findings, our results show that model scaling improves our quality (sensibleness, specificity, and interestingness), safety and groundedness metrics to some extent. However, fine-tuning combined with scaling significantly improves performance on all metrics.
Our work is also closely related to recent successes in applying language models to dialog modeling (e.g., [25, 26, 17, 18]), which built on earlier research in neural dialog modeling (e.g., [14, 15, 16, 27, 28]). One of our fine-tuning stages requires training on dialog-only data, which is related to Wolf et al. [29], Dinan et al. [25] and Zhang et al. [30]. Our use of fine-tuning on crowdworker-annotated data to improve interestingness is comparable to Roller et al. [18]. However, we aim to maximize the interestingness of the model’s output distinctly from its ability to engage the user in further interaction.
Our finding that pure scaling has a limited effect on key measures of open-domain dialog model performance echoes that of Shuster et al. [31], who also focus on the problem of groundedness. Recent studies on scaling have found that performance on question-answering tasks improves with model size [32, 33], similar to our findings on pre-trained LaMDA prior to fine-tuning. Our approach to improving model groundedness is broadly consistent with a growing literature on augmenting neural language models with retrieval systems.
Most of the existing literature focuses on the problem of open-domain question-answering rather than dialog generation, and the models themselves are used to index and rank knowledge sources, rather than trained to use an intermediate tool. Given these differences, we note that the range of existing approaches to this problem include the RNNLM [34], RAG [35], REALM [36], and FiD [37] architectures. Zhu et al. [38] provide a survey of further recent work. See Karpukhin et al. [39] for details on the ‘dense passage retriever’ used in RAG. Recent work in this direction has expanded and elaborated on neural models’ ability to retrieve and rank passages [40].
The RETRO architecture demonstrates that language models can be primed with results retrieved from a database as large as two trillion tokens [41]. At a broad level, our approach is also comparable to that of Byrne et al. [42], which fine-tunes the model to use external APIs for movie ticketing dialog. Parts of our findings are similar to recent studies on dialog groundedness. Granting access to external knowledge bases has been shown to reduce the rate at which models hallucinate unsourced statements in dialog across a variety of retrieval systems and model architectures [31].
Another study finds that a question-answering system’s accuracy is improved by separating it into a reasoning unit and a response generator, analogous to our separation of ‘Base’ and ‘Research’ models in our study [43]. Meanwhile, the WebGPT framework includes a language system that ca