General multimodal learning systems should retain both linguistic (text-only) and multimodal capabilities, while continuously acquiring new multimodal capabilities.
Multimodal LLMs → Undergo large linguistic forgetting i.e. loss in linguistic abilities, during multimodal training.
Can we mitigate linguistic forgetting efficiently?
Summary Results: Our best CL method compared to LLaVA 1.5 and the base unimodal LLM, at 2.8B scale.
Study: We study LLaVA1.5 with 9 choices of base LLMs of varying scales and instruct-tuning: 2 LLaMA2 (7B) models, 6 Pythia models (160M - 2.8B), and Phi2 (3B).
Linguistic forgetting, esp. NLG forgetting is significant, and varies with scale and type of base LLM.
We adapt ideas from Continual Learning to train MLLMs (LLaVA): Rehearsal, LoRA (Low Rank Adaptation), Soft Targets, Stability Gap Minimization
MLLMs undergo alignment pre-training per LLaVA1.5 protocol. Mitigation methods applied during Multimodal LLaVA fine-tuning.
Measuring VL Perf. and Linguistic Forgetting, across 0.16B to 2.8B param scales.
LLaVA with mitigation methods have negligible linguistic forgetting. Match full VL Performance as model scales
We split the LLaVA1.5 multimodal fine-tuning task into 4 groups of Vision-Language (VL) tasks, to create a LLaVA CL setup. CL methods were then applied during continual training.
Continual LLaVA Setup: Sequence of tasks from LLaVA Instruct
Measuring VL Perf. and Linguistic Forgetting, averaged over all CL tasks, across 0.16B to 1.4B param scales.
mSGM + R. matches full VL perf., with negligible (or even negative) linguistic forgetting (+ve BWT) in continual LLaVA training