xAR - An Autoregressive Visual Generation Framework Jointly Launched by ByteDance and Johns Hopkins University | Neurokit Ai

What is xAR?

xAR is a novel autoregressive visual generation framework proposed jointly by ByteDance and Johns Hopkins University. The framework addresses the issues of insufficient information density and cumulative errors in traditional autoregressive models for visual generation through "Next-X Prediction" and "Noisy Context Learning" technologies.

Main Functions of xAR

Next-X Prediction: Extends the traditional "next token prediction" to support the model in predicting more complex entities (such as image patch units, subsamples, entire images, etc.), capturing richer semantic information.

Noisy Context Learning: Improves the model's robustness to errors by introducing noise during training, alleviating the problem of cumulative errors.

High-Performance Generation: On the ImageNet dataset, the xAR model outperforms existing technologies such as DiT and other diffusion models in both inference speed and generation quality.

Flexible Prediction Units: Supports various prediction unit designs (such as units, subsamples, multi-scale prediction, etc.), suitable for different visual generation tasks.

Technical Principles of xAR

Flow Matching: xAR transforms the discrete token classification problem into a continuous entity regression problem based on the flow matching method. Specifically:

The model generates noisy inputs through interpolation and noise injection.

In each autoregressive step, the model predicts the direction flow (Velocity) from the noise distribution to the target distribution, thereby gradually optimizing the generation results.

Inference Strategy: During the inference phase, xAR generates images step by step in an autoregressive manner:

First, it predicts the initial unit (such as an 8x8 image patch) from Gaussian noise.

Based on the already generated units, the model gradually generates the next unit until the entire image is completed.

Experimental Results: xAR has achieved significant performance improvements in the ImageNet-256 and ImageNet-512 benchmarks:

The xAR-B (172 million parameters) model is 20 times faster in inference speed than DiT-XL (675 million parameters), while achieving a Fréchet Inception Distance (FID) of 1.72, outperforming existing diffusion models and autoregressive models.

The xAR-H (1.1 billion parameters) model achieved an FID of 1.24 on ImageNet-256, setting a new optimal level, and does not rely on visual foundation models (such as DINOv2) or advanced guided interval sampling.

Project Address of xAR

Project Website: https://oliverrensu.github.io/project/xAR/

arXiv Technical Paper: https://arxiv.org/pdf/2502.20388

Application Scenarios of xAR

Artistic Creation: Artists can use xAR to generate creative images as inspiration for artworks or directly for creation. xAR can generate images with rich details and diverse styles, supporting different resolutions and style creation needs.

Virtual Scene Generation: In game development and virtual reality (VR), xAR can quickly generate realistic virtual scenes, including natural landscapes, urban environments, and virtual characters, enhancing user experience.

Old Photo Restoration: By generating high-quality image content, xAR can restore damaged parts of old photos, recovering their original details and colors.

Video Content Generation: xAR can generate specific scenes or objects in videos for video effects production, animation generation, and video editing.

Data Augmentation: By generating diverse images, xAR can expand training datasets, improving the generalization ability and robustness of models.

xAR - An Autoregressive Visual Generation Framework Jointly Launched by ByteDance and Johns Hopkins University

What is xAR?

Main Functions of xAR

Technical Principles of xAR

Project Address of xAR

Application Scenarios of xAR

WorldCraft - A 3D Virtual World Creation and Customization System Launched by HKUST

What are Hallucinations of Large Models?

Related articles