AI Pulse - Thursday, March 14th 2024
Gemma: Open Models Based on Gemini Research and Technology
Our first paper discusses a family of open language models called Gemma, developed by Google DeepMind based on the research and technology behind their Gemini models. Gemma comes in two sizes - a 7 billion parameter model suited for GPU/TPU deployment, and a 2 billion parameter model for CPU and on-device applications. Both demonstrate strong performance across benchmarks for language understanding, reasoning, and safety evaluations. The novel aspects include releasing both pretrained and instruction-tuned checkpoints openly, along with code for inference and model serving. Gemma outperforms similar-sized open models on 11 out of 18 language tasks evaluated. Extensive evaluations were done on the models' safety aspects like toxicity, bias, factuality etc. The key results show Gemma 7B achieving 64.3% on the challenging MMLU benchmark, outperforming LLaMA2-13B. On coding tasks like MBPP, it gets 44.4%, beating even code-specialized models. For mathematics like GSM8K, it reaches 46.4% which is very strong compared to alternatives. The authors believe responsible release of such powerful language models is critical for improving safety, enabling auditing, and furthering innovation in this field. They discuss potential risks from open release and mitigation strategies employed. Overall, Gemma represents a state-of-the-art open offering aimed at benefiting the wider AI research community.
Simple and Scalable Strategies to Continually Pre-train Large Language Models
Here are the key takeaways and rules of thumb for continually pre-training large language models based on the findings in this paper: Main Findings: - Re-warming and re-decaying the learning rate is necessary for efficiently adapting to new data during continual pre-training. Higher maximum learning rates lead to better adaptation but more forgetting. - Using a small percentage of replay data from previous datasets (e.g. 5% for a weak shift, 25% for a strong shift) significantly reduces forgetting with little cost to adaptation on the new data. - The combination of learning rate re-warming, re-decaying, and replay allows continually pre-trained models to match the performance of models trained from scratch on all available data, while using substantially less compute. - Infinite learning rate schedules that transition to a high constant rate and then rapidly decay show promise as an alternative to cosine schedules that avoids re-warming issues. Rules of Thumb: Learning Rate Schedule: - If using a cosine decay schedule that ended at a small learning rate during pre-training: - Re-warm and re-decay the learning rate to improve adaptation to new data - Decreasing the max learning rate can reduce forgetting, increasing it can improve adaptation - Infinite learning rate schedules transition to a high constant rate to avoid re-warming issues, and then rapidly decay at the end to train to convergence. They also avoid committing to a fixed token budget. Replay: - Use around 5% replay as a default value - Use more replay (e.g. 25%) for stronger distribution shifts - As little as 1% replay can work for very weak distribution shifts The paper provides extensive empirical evidence across model scales (405M and 10B parameters) and distribution shifts (weak English-English and stronger English-German) to support these findings and recommendations for efficiently continually pre-training large language models on new data as it becomes available.
VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis
Next up is an overview of how this paper can be applied practically and what is novel about it. The key practical application of VLOGGER is generating photorealistic videos of humans talking and moving, including gestures, based solely on an audio or text input along with a single image of the person. This has numerous potential use cases: - Content creation for films, games, virtual assistants or online communication where realistic human animation is needed but motion capture is impractical. - Educational videos or presentations where an animated instructor avatar could narrate material. - Low-bandwidth videoconferencing by sending just audio and generating video on the other end. - Creative video editing by modifying speech, expressions or movements in existing footage. What makes VLOGGER novel is that unlike previous work, it does not just animate the face region, but generates video of the full upper body with natural head motions, hand gestures and body language learned from data. It generalizes to any new person from just a single image without needing training examples. The generated video temporally coherent and high resolution. VLOGGER introduces several technical innovations compared to prior work: 1) A stochastic motion generation module that maps audio to realistic 3D body/face movement sequences. 2) A diffusion-based video generation architecture with novel spatial and temporal conditioning to render those 3D movements into pixels. 3) Using warped image views as an additional conditioning signal to help preserve person identity across frames. 4) A large and diverse dataset called MENTOR with 3D body/expression labels to train the motion and video generation modules on a wide range of people and gestures. So in summary, VLOGGER takes audio-driven human animation much further than previous face-centric approaches, opening up exciting new applications by generating full-body video vocals with just a single image as input.
Scaling Up Dynamic Human-Scene Interaction Modeling
Next up is a summary of the key points about applying this paper's research practically, what is novel, and interesting results: This work makes several important practical contributions to modeling human-scene interactions (HSIs): Novel Aspects: - It introduces TRUMANS, the largest and most comprehensive motion capture dataset of HSIs to date, spanning over 15 hours of data across 100 indoor scenes. This dataset captures precise whole-body motions along with part-level object dynamics with a focus on realistic contact interactions. - It proposes a novel diffusion-based autoregressive model that can generate arbitrary-length HSI motions conditioned on the 3D scene context and frame-wise action labels. This allows for real-time controllable synthesis of human motions adhering to the scene geometry and specified actions. Practical Applications: - TRUMANS provides an extensive high-fidelity dataset that can drive advances across various tasks like human pose estimation, motion synthesis, scene understanding, and human-object interaction modeling. - The proposed motion synthesis method enables generating realistic human animations of any duration for virtual characters in 3D environments, with intuitive control over their navigation and fine-grained interactions. This has applications in areas like computer animation, robotics, and virtual/augmented reality. Interesting Results: - Experiments demonstrate the generated motions closely approximate the quality of the original motion capture data, outperforming existing baselines. - The method exhibits remarkable zero-shot generalization capability, producing plausible motions even in novel 3D scenes unseen during training. - The supplementary video showcases the controllability and physical plausibility of the synthesized motions, with characters seamlessly navigating cluttered environments, interacting with objects, and responding to changing action instructions on-the-fly. In summary, this work provides a valuable dataset and computational model that significantly advances the state-of-the-art in synthesizing realistic, controllable human-scene interactions suitable for deployment in various practical applications.
SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents
Here are the key points about the paper SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents: Introduction: - The paper proposes SOTOPIA-π, an interactive learning method to improve the social intelligence of language agents through social interactions. - It leverages behavior cloning from an expert model (GPT-4) and self-reinforcement learning on the agent's own positive behaviors rated by GPT-4. Method: - SOTOPIA-π has 3 steps: 1) Generating diverse social tasks using GPT-4, 2) Collecting multi-turn conversation data between agent pairs, 3) Updating agent policy via behavior cloning and/or self-reinforcement on GPT-4 rated positive examples. - For self-reinforcement, it filters training data based on GPT-4 ratings of the goal completion dimension. Experiments: - It shows SOTOPIA-π improves social goal completion ability of a 7B LLM to approach GPT-4 level according to GPT-4 evaluation. - However, the gap between GPT-4 and human evaluation increases, indicating limitations of using LLMs for evaluation. - It also improves safety, reduces toxicity, and maintains general question-answering ability after training. Limitations: - Using LLMs like GPT-4 for evaluation may introduce biases. - Only studying safety as one aspect of social alignment. - Potential social biases inherited from the GPT-4 powered environment. In summary, SOTOPIA-π demonstrates an interactive learning approach to improving an LLM's social intelligence abilities while highlighting the need for more robust evaluation beyond just LLM ratings.