AI Pulse - Wednesday, March 20th 2024
AnimateDiff-Lightning: Cross-Model Diffusion Distillation
Our first paper presents AnimateDiff-Lightning, a lightning-fast model for video generation. The key ideas are: 1. Applying progressive adversarial diffusion distillation to the video modality for the first time. This allows generating high-quality videos with just a few inference steps, much faster than previous methods. 2. Proposing cross-model diffusion distillation to simultaneously distill the motion module on multiple different stylized base image models like stable diffusion, anime models, etc. This improves the distilled module's compatibility across different base models. The progressive adversarial distillation technique uses discriminators to ensure the distilled student model's outputs match the distribution of the teacher model, while allowing some drift for better quality. Cross-model distillation involves assigning different base models to different GPUs during training to enable efficient multi-model distillation. The results show AnimateDiff-Lightning can generate sharper videos with fewer steps (e.g. 1-4 steps) compared to previous distillation methods like AnimateLCM. It better preserves the original style of the base models, even for unseen models not used during distillation. The distilled model is also compatible with fine control techniques like Motion LoRAs and ControlNet. The practical application is enabling lightning-fast high-quality video generation and editing for creative uses, considerably speeding up computationally intensive tasks. The cross-model distillation approach could also be applied to improve universal pluggable modules for other generative AI domains.
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Here are the key points about the paper "mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding": Introduction: - The paper proposes unified structure learning across 5 domains (documents, webpages, tables, charts, natural images) for multimodal large language models to enhance OCR-free document understanding. - It designs a vision-to-text module called H-Reducer that maintains layout information while reducing sequence length for efficient processing of high-resolution images. Model Architecture: - Uses a frozen ViT/L-14 visual encoder and modality-adaptive LLM decoder. - H-Reducer uses 1x4 convolution to merge horizontal neighboring visual features, maintaining left-to-right text order while reducing sequence length. Unified Structure Learning: - Comprises structure-aware parsing tasks to organize text according to layout using special tokens. - Also has multi-grained text localization tasks relating text to spatial positions. - Builds DocStruct4M dataset with 4M examples across the 5 domains to support this training. Multi-Task Tuning: - After structure learning, model is fine-tuned on downstream document understanding tasks like VQA, IE, NLI, captioning. DocOwl 1.5-Chat: - Also constructs 25K DocReason25K examples with detailed explanations for reasoning. - DocOwl 1.5-Chat is trained on this plus downstream datasets for better explanations. Evaluation: - Achieves new state-of-the-art on 10 document understanding benchmarks across the 5 domains. - Improves over similar-sized models by over 10 points on 5 out of 10 benchmarks. Key Novelties: - Unified structure learning approach for text-rich images across multiple domains - H-Reducer vision-to-text module for efficient high-res encoding while preserving layout - Large DocStruct4M dataset to support structure understanding training - DocOwl 1.5-Chat with detailed reasoning ability In summary, the paper presents a unified multimodal approach to enhance document understanding capabilities of large language models through structure-aware pretraining and specialized components like H-Reducer. The qualitative examples demonstrate its abilities on various document understanding tasks.
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Here are the key points about the proposed LLMLingua-2 method for task-agnostic prompt compression: - It formulates prompt compression as a token classification task (preserve or discard each token) to guarantee faithfulness to the original content. - It uses a Transformer encoder to capture bidirectional context for better compression performance. - The compressor is trained on a new extractive text compression dataset constructed from MeetingBank by distilling compression knowledge from GPT-4. - It achieves significant performance gains over strong baselines like Selective-Context and LLMLingua across multiple benchmarks. - Despite being much smaller than LLaMa used in baselines, it demonstrates robust generalization ability across different LLMs like GPT-3.5 and Mistral-7B. - It provides 3x-6x faster compression than existing methods, accelerating end-to-end latency by 1.6x-2.9x with 2x-5x compression ratios. - The compressed prompts retain enough information for GPT-4 to accurately reconstruct the original prompts. The key novelties seem to be the formulation as token classification, the new distilled dataset, and the use of a smaller bidirectional encoder, leading to high performance and efficiency gains for task-agnostic prompt compression.
TnT-LLM: Text Mining at Scale with Large Language Models
Here are the key points from the paper on the TnT-LLM framework for text mining at scale with large language models: Introduction: - Text mining involves extracting useful insights from large text collections by generating taxonomies of labels and classifying text using those labels. - Existing methods either rely heavily on human experts (expensive, time-consuming) or unsupervised clustering (interpretability issues). - TnT-LLM aims to combine interpretability of manual approaches with scale of automated clustering using the strengths of large language models (LLMs). Method: - TnT-LLM has two phases: 1. Taxonomy Generation: - Use LLM prompts to iteratively generate and refine a label taxonomy from a corpus sample - Inspired by stochastic gradient descent optimization 2. Text Classification: - Use LLM to label a larger corpus sample with the generated taxonomy - Train lightweight classifiers on these LLM-generated "pseudo-labels" Evaluation: - Use deterministic metrics, human evaluation, and LLM-based evaluation - Evaluate taxonomy quality (coverage, accuracy, use-case relevance) - Evaluate text classification performance against human-annotated data Experiments: - Applied to user intent detection and conversational domain labeling for Bing's open-domain chat system - TnT-LLM generated more accurate and relevant taxonomies than clustering baselines - Lightweight classifiers trained on LLM pseudo-labels performed competitively with full LLM classifiers Key Findings: - LLMs can automate interpretable taxonomy generation with minimal human effort - LLMs can approximate human judgments for certain evaluation tasks - Using LLMs as data annotators enables distilling their knowledge into efficient classifiers The paper proposes TnT-LLM as a novel framework leveraging the reasoning and generation capabilities of large language models to perform text mining tasks at scale while maintaining interpretability.
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Here are the key points about the research paper on transferring reasoning capabilities from large language models to vision-language models for chart understanding: Next up is a summary of a second paper proposing a technique to transfer reasoning abilities from large language models (LLMs) to smaller vision-language models (VLMs) like PaLI for chart question answering. The main ideas are: 1. Improve the chart representation in the VLM by continued pre-training on a chart-to-table translation task. 2. Construct a large 20x dataset by generating synthetic question-answer pairs and rationales using prompting with a large LLM like PaLM on the table data. This includes programmatically generating arithmetic examples. 3. Fine-tune the VLM using a multi-task loss to predict both the answer and rationale, transferring reasoning capabilities from the LLM. 4. Their ChartPaLI-5B model outperforms even 10x larger models on the ChartQA benchmark by this transfer technique. 5. For handling remaining arithmetic errors, they use a simple online refinement with the LLM and program-of-thought prompting to recompute values in the predicted rationales. The practical applications are enabling stronger reasoning for chart understanding in smaller VLMs suitable for applications like data analysis tools. Novel aspects include the multi-task transfer approach and simple yet effective arithmetic refinement method.