Improving Multimodal Large Language Models Using Continual Learning

1 University of Rochester 2 Rochester Institute of Technology
*Corresponding author: shikhar.srivastava@rochester.edu
CoLLAs 2025 Conference
SCFM, NeurIPS 2024 Workshop
A diagram showing how continual learning can be used to improve multimodal large language models by mitigating catastrophic forgetting of linguistic abilities.

Aim

General multimodal learning systems should retain both linguistic (text-only) and multimodal capabilities, while continuously acquiring new multimodal capabilities.

Problem

Multimodal LLMs → Undergo large linguistic forgetting i.e. loss in linguistic abilities, during multimodal training.

Our Goal

Can we mitigate linguistic forgetting efficiently?

Our Approach

  • Treat Multimodal training as a Continual Learning (CL) task
  • Employ CL methods to reduce linguistic forgetting efficiently during training
Results comparison showing improved performance with continual learning methods

Summary Results: Our best CL method compared to LLaVA 1.5 and the base unimodal LLM, at 2.8B scale.

#1: Investigating Linguistic Forgetting in MLLMs 🔍

Study: We study LLaVA1.5 with 9 choices of base LLMs of varying scales and instruct-tuning: 2 LLaMA2 (7B) models, 6 Pythia models (160M - 2.8B), and Phi2 (3B).

Takeaway

Linguistic forgetting, esp. NLG forgetting is significant, and varies with scale and type of base LLM.

#2: Mitigating Linguistic Forgetting in MLLMs 🛡️

Approach

We adapt ideas from Continual Learning to train MLLMs (LLaVA): Rehearsal, LoRA (Low Rank Adaptation), Soft Targets, Stability Gap Minimization

Exp. Setup

MLLMs undergo alignment pre-training per LLaVA1.5 protocol. Mitigation methods applied during Multimodal LLaVA fine-tuning.

Comparisons with LLaVA

Measuring VL Perf. and Linguistic Forgetting, across 0.16B to 2.8B param scales.

Takeaway

LLaVA with mitigation methods have negligible linguistic forgetting. Match full VL Performance as model scales

#3: Mitigating Linguistic Forgetting in MLLMs while fine-tuning continually 🔄

Approach

We split the LLaVA1.5 multimodal fine-tuning task into 4 groups of Vision-Language (VL) tasks, to create a LLaVA CL setup. CL methods were then applied during continual training.

Continual LLaVA Setup showing the sequence of multimodal tasks

Continual LLaVA Setup: Sequence of tasks from LLaVA Instruct

Comparisons with LLaVA

Measuring VL Perf. and Linguistic Forgetting, averaged over all CL tasks, across 0.16B to 1.4B param scales.

Takeaway

mSGM + R. matches full VL perf., with negligible (or even negative) linguistic forgetting (+ve BWT) in continual LLaVA training