OmniAlign-V: A High-Quality Dataset by Shanghai Jiao Tong University and Shanghai AI Lab
AI Product Observation

OmniAlign-V: A High-Quality Dataset by Shanghai Jiao Tong University and Shanghai AI Lab

  • OmniAlign-V
  • Multimodal training data
  • Image filtering
  • Infographic interpretation
  • Creative content generation
  • Hugging Face Model Hub
  • Education assistance
Tina

By Tina

March 27, 2025

What is OmniAlign-V?

OmniAlign-V is a high-quality dataset jointly developed by Shanghai Jiao Tong University, Shanghai AI Lab, Nanjing University, Fudan University, and Zhejiang University. It is specifically designed to enhance the alignment of multimodal large language models (MLLMs) with human preferences.

OmniAlign-V contains approximately 200,000 multimodal training samples, covering natural images and infographics, combined with open-ended, knowledge-rich Q&A pairs. The dataset emphasizes task diversity, including knowledge-based Q&A, reasoning tasks, and creative tasks, to improve model alignment through complex problems and diverse response formats. OmniAlign-V introduces an image selection strategy to ensure that semantically rich and complex images are used for data generation.

Key Features of OmniAlign-V

High-Quality Multimodal Training Data: Contains approximately 200,000 multimodal training samples, including natural images and infographics (such as posters and charts). The dataset integrates complex questions and diverse response formats to help models better understand human preferences.

Enhanced Open-Ended Q&A Capabilities: Designed with open-ended questions, cross-disciplinary knowledge, and comprehensive responses, enabling models to generate answers that align more closely with human expectations.

Improved Reasoning and Creativity: Trains models to perform more complex thinking and creative tasks, enhancing their performance in multimodal interactions.

Optimized Multimodal Instruction Tuning: Uses high-quality instruction tuning data to help models better follow human instructions while retaining fundamental capabilities such as object recognition and OCR.

Support for Continuous Optimization of Multimodal Models: OmniAlign-V is used for supervised fine-tuning (SFT) and further enhances model alignment through Direct Preference Optimization (DPO).

Technical Principles of OmniAlign-V

Image Filtering & Classification: Uses Image Complexity (IC) scoring and Object Category (OC) filtering to select semantically rich and complex images. Images are categorized into natural images and infographics, with distinct task designs for each type.

Natural Image Tasks: Include knowledge-based Q&A, reasoning tasks, and creative tasks to improve model understanding and generation capabilities in real-world scenarios.

Infographic Tasks: Designed specifically for charts, posters, and other complex visuals, requiring models to understand and interpret intricate information.

Q&A Generation: High-quality Q&A pairs are generated using advanced models like GPT-4o, with post-processing to optimize data quality.

Post-Processing Optimization: Enhances the generated Q&A pairs through instruction augmentation, reasoning improvements, and refined answer processing for infographics, ensuring diversity and high quality.

Multimodal Training & Optimization: Uses supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) to improve model alignment. The dataset prioritizes diversity and complexity, enabling models to better understand human preferences in multimodal interactions.

Benchmarking & Evaluation: Introduces the MM-AlignBench benchmark to assess MLLMs' performance in human preference alignment, ensuring applicability in real-world scenarios.

Project Links

Official Website: https://phoenixz810.github.io/OmniAlign-V

GitHub Repository: https://github.com/PhoenixZ810/OmniAlign-V

Hugging Face Model Hub: https://huggingface.co/collections/PhoenixZ/omnialign-v

Technical Paper: https://arxiv.org/pdf/2502.18411

Application Scenarios

Multimodal Conversational Systems: Improves the quality of interactions between intelligent assistants and users, generating responses that better align with human preferences.

Image-Assisted Q&A: Combines image information to provide more comprehensive and accurate Q&A services, applicable to fields like education and tourism.

Creative Content Generation: Helps users quickly generate high-quality creative content, such as advertisements and storytelling.

Education & Learning Assistance: Provides students with richer learning materials, assisting in the understanding of complex charts and illustrations.

Infographic Interpretation: Helps users analyze complex charts, offering background knowledge and reasoning insights to enhance data comprehension.



Related articles

HomeiconAI Product Observationicon

OmniAlign-V: A High-Quality Dataset by Shanghai Jiao Tong University and Shanghai AI Lab

Β© Copyright 2025 All Rights Reserved By Neurokit AI.