NEXT-GPT: Any-to-Any Multimodal LLM for Unified Text, Image, Audio, and Video Understanding

Harsha Sai Potluri

doi:10.22178/acta.27.2.1

Authors

Harsha Sai Potluri

DOI:

https://doi.org/10.22178/acta.27.2.1

Keywords:

Multimodal Learning, Large Language Models, Any-to-Any Transformation, Cross-Modal Understanding, Neural Architecture, Deep Learning, Computer Vision

Abstract

The evolution of artificial intelligence has witnessed remarkable progress in developing systems capable of processing individual modalities such as text, images, or audio. However, the real world demands comprehension across multiple modalities simultaneously. This paper introduces NExT-GPT, a novel any-to-any multimodal large language model that bridges the gap between unimodal and truly multimodal AI systems. Unlike existing models that handle limited modality combinations, NExT-GPT processes and generates content across text, image, audio, and video in a unified framework. The architecture leverages pre-trained encoders for each input modality, a central language model serving as the cognitive core, and specialized decoders for multimodal output generation. We employ modality-switching instruction tuning to enable seamless transitions between different input-output combinations. Experimental evaluation on diverse benchmarks demonstrates that NExT-GPT achieves competitive performance on standard tasks while uniquely supporting 16 distinct any-to-any modality transformation scenarios. The model achieves 87.3% accuracy on multimodal understanding tasks and generates high-quality outputs with CLIP scores averaging 0.82 for image generation and MOS scores of 4.1 for audio synthesis. This work represents a significant step toward developing truly versatile AI systems capable of human-like multimodal perception and expression.