The quadratic complexity of Multimodal Large Language Models (MLLMs) with respect to context length poses significant computational and memory challenges, hindering their real-world deployment. In the paper, we devise a ''filter-correlate-compress'' framework to accelerate the MLLM by systematically optimizing multimodal context length during prefilling. The framework first implements FiCoCo-V, a training-free method operating within the vision encoder. It employs a redundancy-based token discard mechanism that uses a novel integrated metric to accurately filter out redundant visual tokens. To mitigate information loss, the framework introduces a correlation-based information recycling mechanism that allows preserved tokens to selectively recycle information from correlated discarded tokens with a self-preserving compression, thereby preventing the dilution of their own core content. The framework's FiCoCo-L variant further leverages task-aware textual priors to perform token reduction directly within the LLM decoder. Extensive experiments demonstrate that the FiCoCo series effectively accelerates a range of MLLMs, achieves up to 14.7x FLOPs reduction with 93.6% performance retention. Our methods consistently outperform state-of-the-art training-free approaches, showcasing effectiveness and generalizability across model architectures, sizes, and tasks without requiring retraining.
Based on the paradigm, we develop a series of methods named FiCoCo that efficiently reduce the amount of visual token without re-training. FiCoCo-V reduces tokens in the visual encoder, and FiCoCo-L reduces tokens in the LLM decoder.
Despite the above figure, we further provide the algorithm illustration for FiCoCo-V and FiCoCo-L to clarify their distinct solutions across three stages.
We illustate the comparison to existing token reduction methods, where our FiCoCo-V and FiCoCo-L achieve state-of-the-art results with five popular MLLMs across benchmarks:
Please refer to our paper for detailed experimental results.
@inproceedings{FiCoCo2025,
title={Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration},
author={Yuhang Han and Xuyang Liu and Zihan Zhang and Pengxiang Ding and Junjie Chen and Donglin Wang and Honggang Chen and Qingsen Yan and Siteng Huang},
booktitle={Proceedings of the 40th AAAI Conference on Artificial Intelligence}
year={2025}
}