Scaling to a Trillion Tokens

Building a GPT for robotics

Aug 20, 2025

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.

Rich Sutton, The Bitter Lesson

Much of the progress we’ve seen in models is derived from the principle that capability stems from methods that continually improve as compute and data scale. When we can pour more FLOPs and more tokens into the same recipe and see predictable power‑law improvements in loss, we’re on a scalable path to intelligence. This phenomenon, captured by the scaling laws we’re all familiar with, is what makes the bitter lesson legible.

It’s true that these scaling laws aren’t an exact science or universal constant, but they do reveal stable relationships between core factors. The Chinchilla paper, for example, showed that compute-optimal training typically requires roughly 20x as many training tokens as model parameters (a guiding principle we’ll use later on)! Today, nearly every discussion about scaling laws centers on three variables: dataset size, model size, and compute.

A few years ago, the main bottleneck in the world of language models was the latter of those variables: compute. You may remember hyperscalers signing multi-year contracts with infra providers in 2023 to secure access to H100 GPUs at $4+/hr because compute was so scarce. Today, the problem of accessing compute is largely solved – prices for H100s have dropped over 40% and companies like Together AI make it easy to rent GPUs on demand at ~$2/hr.

The bottleneck has now shifted to the second variable: model size – not in the sense of setting a higher parameter count, but in scaling model size effectively so that additional parameters and more FLOPs produce consistent capability gains. This depends heavily on talent: today, there is a talent war for researchers who can design architectures and training methods that make larger models efficient, stable, and worth the compute cost.

While compute and model size have both acted as bottlenecks at different times, the third variable – the number of training tokens (dataset size) – has never really been a constraint for language models. The total effective stock of human-generated public text data is on the order of 300 trillion tokens, and while some argue that we’re reaching that limit, data was certainly not a bottleneck when initially bootstrapping these systems.

But, in the world of robotics, the opposite is true: the training tokens needed to bootstrap to a GPT for robotics simply don’t exist at the scale needed. Since the tokens for robotics (action‑conditioned perception sequences) are fundamentally different from those used in language models, they are expensive to collect, difficult to standardize and curate, and heavily skewed away from the long-tail scenarios we actually need to generalize. In other words, robotics isn’t constrained by FLOPs, but rather by real experience. In the physical world, data is the core bottleneck.

The good news is that once robots can actually be deployed, the scale and diversity of data collected could help us push the frontier of intelligence. As Sergey Levine calculated, if every McDonald’s in the US had one robot working for 2 hours a day, you could generate 10,000,000 hours of experience in a single year. Now, multiply that across all the different environments, embodiments, and tasks, and the data bottleneck quickly becomes a thing of the past.

So, the most intellectually stimulating question right now is how do we get to deployment, and by a function of the scaling law, that means answering how we solve the data bottleneck. I don’t have a strong opinion on the exact path (though I believe that co-training will be key), but I think it’s an interesting problem to frame and reason through. In this blog, we’ll walk through the scale of real experience data we need to build a GPT for robotics, the cost of collecting that data, why that cost is prohibitive, how surrogate data can alleviate some of that cost, the drawbacks of surrogate data, and how I’d think about reaching 1.1 trillion tokens.

The Cost of 1.1 Trillion Tokens

Estimating how many tokens we need to build a single, cross-embodiment generalizable robot model involves many assumptions. Using the Chinchilla scaling rule (T ≈ 20P), a 55B-parameter model would be compute-optimal at roughly 1.1T predictive tokens. If we only collect real-world robot experience, use a conservative 8Hz control rate, and budget ~3 predictive tokens per step, then 2 hours/day of useful interaction over 300 days/year yields about 51.84M tokens per robot-year. At that rate, hitting 1.1T tokens would require ~21,200 robot-years (~21 years with 1,000 robots or 4.2 years with 5,000 robots).

So, how much would it actually cost to collect this data in ~4 years with 5,000 robots (accounting for hardware, backups, human supervision, safety, storage, etc)?

Let’s assume a $100K per-unit CAPEX ($550M for 5K units + 10% backup), annual maintenance at 20% CAPEX ($110M/yr), human supervision ops of 1 FTE/10 robots and engineers/safety (~$150M/yr), facilities and safety/compliance (~$30M/yr), on-robot compute (~$8M/yr), and data storage for 2h/day per robot (~$25M/yr). Note: these assumptions exclude training-compute costs and are intentionally conservative (e.g., a $100K per-unit CAPEX).

Using these rough assumptions, the cost is somewhere in the region of $2B. That number can scale up and down significantly depending on unit cost, duty-cycle/uptime, operator ratios, etc. But, for the purpose of this article, let’s just assume our $2B number is a somewhat reasonable upper-bound prediction.

Spending that amount isn’t infeasible, but in practice, the challenge isn’t just the CAPEX of the robots. Robust generalization also requires diverse data: many embodiments, object types, environments, and tasks. So, even if someone funded the project, they would likely be left with a large, but narrow, corpus of data. And empirical imitation‑learning scaling results suggest diminishing returns kick in hard when you keep revisiting the same scenes; performance scales as a rough power law in unique environments, not raw episode count. You also have to factor in the complexity of cross-embodiment training – manufacturing, deploying, and maintaining 5,000 robots becomes a lot more demanding if the fleet comprises 20 different embodiments.

All of this is to say that while real-world data collection at this scale may be plausible, once you factor in the cost, time, and uncertainty of achieving robust generalization (even if the model trains successfully), it is clear we need an alternative approach. Put simply, relying solely on real-world data at the scale and diversity needed is impractical.

As an aside, Chris Paxton has a great blog on the cost and time to reach 2T tokens. Our assumptions differ, specifically in the range of predictive tokens per step and operational logistics, but he arrived at 70,000 robot years for 2T tokens. At 1T tokens and a fleet of 5,000 robots, his estimation reaches 7 years (70,000/2*5000), versus my 4.2 years.

Surrogate Data

The impracticality of collecting real-world data is where surrogate data comes in. At its core, surrogate data is cheaper to collect and scalable, but it is only as good as its ability to approximate the target domain. There are a few leading sources of surrogate data today:

Teleoperation: a human controls the robot to complete tasks (like flipping pancakes), creating clean, labeled data as a result. There are startups working to reduce the cost of teleoperation (including a YC-backed company called Sensei).
Learning from video: Robots watch video demonstrations (e.g., from AR devices) and try to extract meaningful patterns.
Simulation: Instead of relying on data collected in the real world, this approach simulates the real world and trains robots in the resulting simulation.

Each type of surrogate data enables us to inflate the token count without increasing CAPEX – you don’t need an additional robot to generate more training tokens. This brings down the time and cost. For instance, if we used simulation and real-world co-training, our 1.1T token cost equation changes to the following:

Let’s assume we budget 1:4 real-to-sim tokens to reflect the lower per-token transfer value of simulation for real-world performance. This means our 1.1T token goal would require 220B real tokens + 880B sim tokens.

At ~51.84M real tokens per year per robot (8 Hz, 3 tokens/step, 2 h/day, 300 days), 220B real tokens requires us to collect ~4,244 robot-years of data. For the remaining 880B simulated tokens, we can reasonably assume that they can be generated ~10x faster than real data (518.4M tokens per year). So, to generate 880B sim tokens, we add ~1,700 robot-years. This means under our 1:4 real-to-sim token split, we need to collect ~6,000 robot-years worth of data.

With 5,000 robots at that utilization, that takes ~1.2 years, which is ~3.5x quicker than collecting just 1.1T real tokens (~4.2 years at the same rate). Alternatively, keeping the original ~4.2-year timeline would only require 1,010 robots (220B / 4.2 / 51.84M ≈ 1,010) instead of 5,000. Using our $100K/unit CAPEX assumption, that saves $400M. I’m excluding sim-compute costs as that is negligible compared to CAPEX.

Ultimately, this shows how co-training using surrogate data can potentially bend the iso-performance curve: you need far fewer real tokens for the same robustness (in a hopeful scenario).

But, surrogate data isn’t the same as real-world data. We made the assumption above that 1.1T effective tokens of real + sim will be sufficient, but the truth is it may not – in practice, simulation and other surrogate sources often produce weaker generalization results, especially under large domain shifts. As Sergey Levine wrote about recently, surrogate data is simply “The Next Best Thing”. I highly recommend reading his full blog, but to quickly distill the core points:

By searching for an alternative to real data, we create a constraint: “With each domain gap (simulation, videos, etc.), we are constrained to solutions that lie in the intersection of behaviors that actually work with our system, that can be done with our method of choice (e.g., in simulation, or with hand-held grippers) and, crucially, that do not exacerbate the difference between the domains (e.g., revealing that there is no robot holding the gripper, or surfacing a particularly severe simulation/real-world discrepancy).”
But, as models improve, they get better at understanding the difference between the surrogate data domain and the target real-world domain. The higher the model capacity, the smaller the safe overlap between the sim/human domain and the real deployment domain. Researchers try to avoid this by hiding information from the robot, but that underscores the true power of models – “their ability to synthesize complex information sources and extract subtle patterns that are hard for humans to identify manually”.
The result: “What we get is a spork: it can do the job of both a fork and a spoon in a few cases that match our assumptions, but usually it ends up just being a lousy spoon with holes or an ineffective blunt fork. What consistently works best in machine learning is to ensure that the training data matches the test conditions …When we substitute out real data for surrogate data, we are doing the Next Best Thing: a surrogate that matches the real deal under a few specific conditions”.

Sergey makes a lot of great points, but while we can acknowledge that nothing will match the performance of real-world data, the cost of collecting real-world data is also too high in diverse environments at scale. Also, as Sergey mentions in his blog, it’s not that surrogate data won’t be useful; it just isn’t good enough in isolation. Watching videos alone won’t teach you how to play golf; you ultimately learn by taking the shot yourself. That’s not to say that watching videos doesn’t help at all – in fact, there is a strong case to be made that what separates the good from the great in any domain is the ability to effectively learn from observation. Surrogate data can help us do that by transferring useful structure, but we will still need real data to calibrate, close the sim-to-real gap, and validate robustness. This is why surrogate data will play a critical role on the path to 1.1T tokens, but also why it won’t be sufficient on its own.

As a result, the question we posed at the beginning (“how do we get to deployment?”) can now be framed as a more nuanced question: what combination and scale of surrogate and real-world data enables us to bootstrap to deployment-grade policies at the lowest possible cost and time?

I view that as the most intellectually exciting question today. It involves several sub-questions, including factors such as surrogate data fidelity, domain gap, data ratios, and how aggressively you weight and fine-tune with real interaction.

I don’t have an answer to those questions, but if we look at the different research labs, we can infer what avenues they appear excited about. Skild AI just began its public launch, and its blog gave a brief overview of how they think about some of the questions we’ve considered:

“How does one obtain the scale of action data to build a true robotics foundation model? Over the past decade, our team has tackled this challenge head-on. In their previous work, our team members have not only pioneered scalable real-world data collection strategies such as self-supervised robots and imitation learning, but also tried to explore alternatives such as using internet videos and large-scale simulation. Over this past decade, one thing has become crystal clear: scale does not mean million or billion examples, achieving scale requires collecting trillions of examples - and there is no way just real world data can achieve this scale in near future. At Skild AI, we tackle this challenge through large-scale simulation and internet video data to pre-train our omni-bodied brain. We post-train this foundation model using targeted real-world data to deliver working solutions to our customers.”

While Skild AI seems to find value in the large-scale simulation and internet video data, the team at Physical Intelligence appears to value other types of surrogate sources instead. Specifically, their π_0.5 model co-trains on sources including web data (image captioning, question answering, and object localization) alongside data of related tasks collected under laboratory conditions (teleop).

Ultimately, the answer to all the questions we’ve discussed is still very much an area of active research. I don’t have any strong views on the best path, but some observations are the following:

Simulation will be used as a coverage engine to cost-effectively inject variation (lighting/gloss, partial failures, sensor/control noise, rare edge cases) and to stress-test policies. It will likely be valuable for locomotion (where dynamics are easier to model) and helpful but more limited for object manipulation (given the complications of modeling the physics of every object).
There’s also room for a human-data pipeline company. First-person, annotated video of real tasks will likely help for pretraining perception; it won’t fundamentally teach robots how to control an object, but can reduce the scale of real data needed.
Having a real-world fleet collecting data will be necessary. I don’t think there’s a way to get around the need for real data, and imitation‑learning scaling results already establish the principle that you need to train in real environments, not just labs. 10 robots across 50 distinct sites is better than 50 robots in a lab setting.
- Curation is also just as important as deployment – diverse experiences matter, and large amounts of data in a single domain will not lead to general intelligence. This results in an interesting strategy/execution question: how can robots be deployed in as many different environments as possible, and which environments are most valuable? There’s no winning approach for now. For instance, Physical Intelligence collected 400 hours of data directly with mobile manipulators in a variety of real homes, while Skild AI appears to be exploring security and inspection contexts as potential domains.

In the end, the main goal is to make data a byproduct of work. That’s when the flywheel effect will kick in and data will become yesterday’s problem.

Sourced

Discussion about this post