Engineering Manager, Accelerator Platform

San Francisco, CA | New York City, NY | Seattle, WAFull-TimeManagerSoftware Engineering

You will be redirected to the company career page

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

Every time someone talks to Claude -- through the API, claude.ai, our cloud partners, or any of our expanding surfaces -- the request lands on an AI accelerator. Not one kind, many kinds: TPUs, Trainium chips, GPUs. Each arrives with its own software stack, performance characteristics, failure modes, and operational quirks. Someone has to take raw silicon and turn it into a platform that the rest of Anthropic can build on without thinking about which chip is underneath. That's us.
The Accelerator Platform team owns the bringup and normalization of new hardware platforms for Anthropic's first party inference fleet. We sit between the low-level systems teams and the serving infrastructure that runs production inference -- bridging the gap so that every new accelerator generation ships as a first-class production platform. It's deeply technical work at the intersection of hardware enablement, distributed systems, and ML infrastructure, and it is directly on the critical path for Anthropic's compute strategy.
We're hiring an Engineering Manager to build and lead this team. You'll inherit a small nucleus of experienced engineers and grow it into a standalone platform organization. You'll set technical direction, hire a strong team, and partner closely with hardware vendors, cloud providers, and teams across Inference to bring new accelerator generations online quickly and reliably.

Build and lead the Accelerator Platform team -- hiring, developing, and retaining engineers who thrive at the hardware/software boundary
Own the end-to-end bring-up lifecycle for new accelerator platforms (multiple generations of Trainium, TPUs, and GPUs), from initial silicon availability through production-ready inference
Define and drive the platform normalization layer -- ensuring new hardware integrates cleanly with Anthropic's inference serving stack to provide a consistent abstractio
Partner with cloud providers (AWS, GCP, Microsoft Azure) and chip vendors on hardware roadmaps, capacity planning, and platform-specific technical challenges
Collaborate closely with teams across Inference and Infrastructure to ensure new platforms meet production reliability and latency requirements from day one
Contribute to Anthropic's multi-cloud compute strategy -- helping the organization maintain optionality across accelerator families and avoid lock-in to any single vendor
Manage the team's priorities across competing demands: new platform bring-up, ongoing production support for existing platforms, and longer-term investments in tooling and automation.

Have significant experience managing infrastructure or platform engineering teams (3+ years in engineering management)
Have deep technical fluency in systems programming, distributed systems, or hardware/software co-design -- you need to understand the stack deeply enough to make sound technical and hiring decisions
Have experience bringing up or operating heterogeneous compute infrastructure at scale -- whether that's GPU clusters, TPU pods, custom ASICs, or FPGA deployments.
Are comfortable with ambiguity and can build structure where none exists. This team is being carved out as a new entity; you'll be defining its charter, processes, and culture from scratch
Think strategically about hardware roadmaps and can translate vendor capabilities into engineering plans
Build strong cross-functional relationships -- this role requires tight collaboration with hardware vendors, cloud partners, and half a dozen internal teams
Care deeply about both technical excellence and the people doing the work.

Have direct experience with ML accelerator architectures (GPU/CUDA, TPU/XLA, Trainium/Neuron, or similar)
Have worked on ML inference serving infrastructure at scale (1000+ accelerators)
Have experience with Kubernetes-based ML workload orchestration
Understand ML-specific networking (RDMA, InfiniBand, NVLink, ICI) and how interconnect topology affects serving performance
Have experience managing vendor relationships and influencing hardware/software roadmaps
Have led teams through rapid growth phases (hiring 5+ engineers in a short timeframe).
The annual compensation range for this role is listed below.
For sales roles, the range provided is the role’s On Target Earnings ("OTE") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.

We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We're an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills.
The easiest way to understand our research directions is to read our recent research. This research continues many of the directions our team worked on prior to Anthropic, including: GPT-3, Circuit-Based Interpretability, Multimodal Neurons, Scaling Laws, AI & Compute, Concrete Problems in AI Safety, and Learning from Human Preferences.

CompanyAnthropic

LocationSan Francisco, CA | New York City, NY | Seattle, WA

TypeFull-Time

LevelManager

DomainSoftware Engineering

Similar roles you might like