Machine Learning Research

469 Posts

Chart showing failure modes in multi-agent systems, grouped into failures of specification, coordination, and task verification.

More Robust Multi-Agent Systems: Researchers improve multi-agent systems by studying how they tend to fail

Researchers addressed weaknesses in existing multi-agent frameworks. Their systems achieved scientific and technical breakthroughs.

Grok 4 achieves high benchmarks in reasoning, coding, and science, outperforming Gemini, Claude, and OpenAI models.

Machine Learning Research

Grok 4 Shows Impressive Smarts, Questionable Behavior: Grok 4 launches with benchmark records and idiosyncratic behavior

xAI updated its Grok vision-language model and published impressive benchmark results. But, like earlier versions, Grok 4 showed questionable behavior right out of the gate.

An automated data-generation pipeline for producing web-agent training data. LLMs generate browser-based tasks, then attempt to execute them, and evaluate the results.

Machine Learning Research

Generated Data for Training Web Agents: Researchers scale up production of training data for web agents

Developing an agent that navigates the web can involve a lot of human effort spent annotating training examples to fine-tune the agent’s LLM component. Scientists automated the production of data that fine-tuned LLMs effectively for web tasks.

Email from an LLM blackmailing a coworker, generated during an experiment that tested LLM behavior under pressure.

Machine Learning Research

Good Models, Bad Choices: Anthropic made LLMs choose between failing and misbehaving, and they blackmailed executives.

Top large language models, under experimental conditions that pressed them to choose between abandoning their prompted mission and misbehaving, resorted to harmful behavior, researchers found.

Meta Aria Gen 2 smart glasses for AI research, equipped with cameras, microphones, and other sensors for real-time data capture.

Machine Learning Research

Meta’s Smart Glasses Come Into Focus: Meta reveals further details of Aria Gen 2 smart glasses for multisensory AI research

Meta revealed new details about its latest Aria eyeglasses, which aim to give AI models a streaming, multisensory, human perspective.

Diagram comparing LLM answers with and without hints. Hints may influence LLM output without being mentioned in reasoning traces.

Machine Learning Research

Reasoning for No Reason: Anthropic finds chain-of-thought reasoning traces may omit key influences

Does a reasoning model’s chain of thought explain how it arrived at its output? Researchers found that often it doesn’t.

AI model animation predicting Cyclone Alfred’s path. An ensemble graph neural networks produces more-accurate 15-day forecasts.

Machine Learning Research

AI Weather Prediction Gains Traction: U.S. working with Google Weather Lab AI to improve storm forecasts

The U.S. government is using AI to predict the paths of hurricanes.

BitNet b1.58 matrix multiplication shows ternary weights enabling faster neural network computation.

Machine Learning Research

Low Precision, High Performance: Researchers at Microsoft and Tsinghua researchers propose 1.58-bit AI model that rivals full-precision competitors

Reducing the number of bits used to represent each parameter in a neural network from, say, 16 bits to 8 bits shrinks the network’s size and boosts its speed. Researchers took this approach to an extreme: They built a competitive large language model whose weights are limited to three values.

Biomni AI agent analyzes oncogenic pathways using genomics tools like Scanpy and CellxGene.

Machine Learning Research

A Research Agent for All Biology: Biomni, an AI agent for multidisciplinary biology research

An agent designed for broad biological research could accelerate the work of scientists in specialties from anatomy to zoology.

Apple AI models outperform rivals in instruction accuracy and human text evaluations across devices and servers.

Machine Learning Research

Apple Sharpens Its GenAI Profile: Apple updates its on-device and cloud AI models, introduces a new developer API

Apple revamped two vision-language models in a bid to catch up with fast-moving competitors.

Diagram showing AI pipeline using OCR and LLMs to detect racist clauses in historic California property deeds.

Machine Learning Research

LLM Rights Historical Wrongs: Stanford and Princeton researchers fine-tune a language model to identify racial discrimination in property

In Northern California, old property deeds may still include racial clauses: language, made illegal decades ago, that was designed to ban people of color from owning or living in certain homes.

OpenAI o3-pro outperforms o3 and o1-pro on math, science, and coding benchmarks, but responds much more slowly.

Machine Learning Research

Better Video, Fewer Tokens: STORM Processes Fewer Tokens And Still Beats GPT-4o On Video Understanding Benchmarks

Researchers reduced the number of tokens needed to represent video frames to be fed to a transformer.

The FLUX.1 Kontext family of image generators from Black Forest Labs edits images to remove or add objects, apply art styles, and extract details.

Machine Learning Research

More Consistent Characters and Styles: Black Forest Labs Launches FLUX.1 Kontext for Generating and Alterating Images with Consistent Details

Same character, new background, new action. That’s the focus of the latest text-to-image models from Germany’s Black Forest Labs.

Diagram showing how a language model agent gets misled by malicious posts and sites when searching for Nike shoes online.

Machine Learning Research

Phishing for Agents: Columbia University researchers show how to trick trusting AI agents with poisoned links

Researchers identified a simple way to mislead autonomous agents based on large language models.

Machine Learning Research

More Robust Multi-Agent Systems: Researchers improve multi-agent systems by studying how they tend to fail

Grok 4 Shows Impressive Smarts, Questionable Behavior: Grok 4 launches with benchmark records and idiosyncratic behavior

Generated Data for Training Web Agents: Researchers scale up production of training data for web agents

Good Models, Bad Choices: Anthropic made LLMs choose between failing and misbehaving, and they blackmailed executives.

Meta’s Smart Glasses Come Into Focus: Meta reveals further details of Aria Gen 2 smart glasses for multisensory AI research

Reasoning for No Reason: Anthropic finds chain-of-thought reasoning traces may omit key influences

AI Weather Prediction Gains Traction: U.S. working with Google Weather Lab AI to improve storm forecasts

Low Precision, High Performance: Researchers at Microsoft and Tsinghua researchers propose 1.58-bit AI model that rivals full-precision competitors

A Research Agent for All Biology: Biomni, an AI agent for multidisciplinary biology research

Apple Sharpens Its GenAI Profile: Apple updates its on-device and cloud AI models, introduces a new developer API

LLM Rights Historical Wrongs: Stanford and Princeton researchers fine-tune a language model to identify racial discrimination in property

More Reasoning for Harder Problems: OpenAI debuts o3-pro, an updated reasoning model that applies more tokens at inference

Better Video, Fewer Tokens: STORM Processes Fewer Tokens And Still Beats GPT-4o On Video Understanding Benchmarks

More Consistent Characters and Styles: Black Forest Labs Launches FLUX.1 Kontext for Generating and Alterating Images with Consistent Details

Phishing for Agents: Columbia University researchers show how to trick trusting AI agents with poisoned links

Subscribe to The Batch