Reinforcement learning agent controls DIII-D plasma shape through sensor failures

Category: Diagnostics, Magnets, Simulations, Tokamak

May 20, 2026

Interior of the DIII-D tokamak vacuum vessel, with a technician in orange standing inside for scale.

Inside DIII-D – the tokamak where a reinforcement learning agent took direct command of the coils

(Image courtesy of General Atomics)

A reinforcement learning agent has taken direct command of the magnetic coils on the DIII-D tokamak, tracking changing plasma shape targets while tolerating random sensor failures, and doing so without falling back on a backup controller. The work was published by Next Step Fusion, a Luxembourg-based company specialising in AI-driven fusion plasma control, and UC San Diego. It replaces the conventional two-stage pipeline with a single learned policy deployed live at 4 kHz on the DIII-D plasma control system. The paper is a preprint and has not yet been peer reviewed.

The two-stage pipeline and where it breaks

Conventional plasma shape control on DIII-D runs in two stages. Real-Time EFIT (RTEFIT) reconstructs the plasma equilibrium from magnetic diagnostics in real time, inferring the boundary from coil currents, magnetic probes, and flux loops. A linear multi-input multi-output controller then drives the poloidal field coils against that reconstruction. Next Step Fusion demonstrated that an RL agent can replace this pipeline for static shape targets, with that work peer-reviewed and published in Nuclear Fusion in January 2026. This new paper tackles the two problems the earlier work left open: shape targets that change during a shot, and sensors that fail without warning.

Magnetic probes and flux loops degrade or fail between shots through hardware faults, calibration drift, and deliberate exclusion. Classical control pipelines were designed for a full sensor set and require manual weight updates to handle each new failure pattern. A controller that operates across arbitrary sensor subsets without that retuning step removes a real operational burden.

The pipeline assumes every sensor in the set is live and behaving. A diagnostic that drops out forces operators onto backup logic, and shape fidelity degrades in the gap. That matters because plasma boundary geometry affects energy confinement, how heat loads distribute on plasma-facing components, and plasma stability.

The Next Step Fusion and UC San Diego team collapses both stages into a single policy. It maps raw diagnostics directly to chopper commands for the DIII-D power supplies, using the same actuator representation as real operations, and is trained to track a shape goal that changes on the fly. Training runs in NSFsim, a free boundary Grad-Shafranov solver with coupled transport equations, configured for DIII-D and modelling the full chopper power system. The agent ingests a 146-dimensional input vector covering 71 magnetic probes, 43 flux loops, 20 poloidal field coil currents, plasma current, and an 11-parameter shape goal.

Training the policy to expect sensor loss

The dataset is built around the conditions that break the classical pipeline: shifting targets and missing diagnostics. The team curated 120 Lower Single Null (LSN) shapes from over 329,000 EFIT equilibria covering DIII-D shots between 2014 and 2020. A greedy diversity rule added each new shape only when its mean pivot point distance to the most recently added exceeded 8 cm. Shape targets are then resampled every 0.25 seconds as random step changes, which exposes the agent to transitions across the full envelope rather than trajectories near a few reference equilibria. Under a one-million-step training budget, the agent encountered roughly 4,000 unique start-target pairs out of 14,400 possible combinations.

Diagnostic dropout is baked in. At the start of each episode, 30% of the 114 maskable channels (the 71 magnetic probes and 43 flux loops) are independently zeroed. The agent gets no flag telling it which channels are gone. A single policy emerges that operates under arbitrary sensor availability, generalising past the fixed-mask assumption that pushes the classical pipeline onto backup logic.

The dropout rate is a tunable hyperparameter, and the team swept it. Evaluated in NSFsim across the fixed DIII-D disabled-sensor mask, the p=0.3 agent hits a mean shape error of 4.1 cm and x-point error of 2.6 cm with low per-shape variance. An oracle trained on a fixed mask and evaluated on the same mask reaches 3.4 cm mean shape error. The 0.7 cm gap is the cost of generalising to arbitrary sensor subsets rather than one specific one. A p=0.1 agent trained on a lower failure rate than real conditions reaches 5.4 cm and degrades when faced with the real mask. On a held-out base shape evaluated with no dropout, the agent records 2.01 cm mean shape error and 1.69 cm x-point error.

Two architectural choices carry the performance. An asymmetric actor-critic gives the critic privileged equilibrium information during training, which sharpens value estimates under partial observation. An auxiliary shape reconstruction head on the actor recovers the boundary from raw diagnostics end to end. Removing that auxiliary loss in ablation raises mean shape error from 4.0 to 4.8 cm and lifts episode length standard deviation from 0.7 to 21.0 steps, evidence of frequent early terminations. In the same ablation set, TQC beats SAC on reward, 0.415 versus 0.362, with x-point error standard deviation of 1.8 cm against 16.7 cm for SAC.

Four-panel figure showing two DIII-D plasma shape maneuvers. Top row: poloidal cross-sections and tracking plot for Discharge 205580, with the x-point radial position stepping from 1.36 m to 1.31 m and the RL agent following closely. Bottom row: cross-sections and centroid tracking for Discharges 205576 and 205580 overlaid, showing a maintained 2.5 cm separation between the two plasma centroids throughout.

The RL agent moves plasma on DIII-D: two maneuvers, two discharges, Lower Single Null held throughout
(Image courtesy of Next Step Fusion &

Two maneuvers, two discharges: what the hardware tests show

The hardware tests are limited in scope but directly probe the two core claims, dynamic tracking and robustness under imperfect sensing. The policy was deployed at 4 kHz on the DIII-D PCS and ran two maneuvers. In Discharge 205580 it executed an x-point radial sweep, with Rx commanded from 1.36 m to 1.31 m. Across Discharges 205580 and 205576 it shifted the plasma centroid, Rc 1.685 m versus 1.660 m, a 2.5 cm difference. Both discharges held LSN throughout.

The auxiliary reconstruction head, evaluated on the physical shots rather than simulation, delivered 1.21 cm mean pivot point error on 205580 and 1.43 cm on 205576. Both sit within typical EFIT reconstruction uncertainty. The same policy transferred to the independent GSevolve simulator without retraining.

The paper reports an elevated x-point error discrepancy in one discharge. The authors attribute this to a possible systematic offset in raw DIII-D magnetic readings that EFIT has been calibrated to absorb, which would shift the RL policy’s inputs out of the training distribution. They note the gap did not affect overall boundary control, and both discharges remained in LSN throughout.

The paper is direct on the headline comparison. The classical isoflux controller achieves lower steady-state shape error than the RL agent in GSevolve at the tested operating points. The isoflux baseline was tuned specifically for those conditions, while the RL policy was trained to span the full shape envelope and survive sensor loss. The trade-off is stated openly. A further hardware factor shaped performance at boundary configurations: the DIII-D patch panel routes multiple coils through shared supply circuits, reducing actuator degrees of freedom and making the rightmost x-point the hardest target to reach. A spatial analysis of sensor importance shows the highest-weighted channels cluster near the 8 target pivot points and the inner limiter wall, with Spearman rank correlation exceeding 0.96 across all four dropout-rate policies. The structure reflects task geometry rather than training noise.

Two maneuvers across two discharges is a constrained demonstration. The authors name two limitations of the present work: reliance on a single tokamak geometry, and mean substitution for absent sensors at inference time.

Where this goes next

Stated future work covers multi-machine transfer, adaptive dropout scheduling, and dynamic mid-shot masking that would model sensor failures occurring during a discharge rather than at episode start. Multi-machine transfer is the one to watch. Next-step devices will not always be able to engineer the diagnostic redundancy that DIII-D enjoys, and a policy that already handles sensor loss in training is a more natural fit for those machines than a classical pipeline designed around a fixed sensor set. Whether the approach generalises across geometries is the question Next Step Fusion has now set up to answer.

Stay ahead in the fusion revolution explore more breakthroughs from leading innovators in clean energy technology.