Anthropic
Building safe, reliable AI for humanity’s future.
Job Description
Anthropic is seeking talented Reliability Engineers (Software or Systems) to define and achieve reliability metrics across its internal and external AI systems. This team will improve service reliability, build observability, and reengineer workflows using AI to enhance how Anthropic operates.
Responsibilities:
-
Define Service Level Objectives (SLOs) for LLM serving and training, balancing latency, availability, and development velocity.
-
Design and implement monitoring systems for latency, availability, and other metrics.
-
Build high-availability model serving infrastructure across regions and cloud providers.
-
Manage failover and recovery mechanisms for AI deployments.
-
Lead incident response and continuous improvements.
-
Optimize infrastructure costs, GPU/TPU/accelerator utilization across workloads.
-
Collaborate with ML engineers and infrastructure teams on reliability, performance, and deployment strategies.
Qualifications / You May Be a Good Fit If You:
-
Extensive experience with distributed systems observability and monitoring at scale.
-
Deep understanding of AI infrastructure challenges (model serving, training pipelines).
-
Experience implementing SLO/SLA frameworks for mission-critical services.
-
Comfortable working with both classical metrics (availability, latency) and AI metrics (model reliability, convergence).
-
Experience in chaos engineering, resilience testing, and bridging ML + infra teams.
-
Preferred: experience with >1000 GPU clusters, accelerators (GPUs, TPUs, Trainium), ML networking (RDMA / InfiniBand), open-source ML tooling contributions.
-
Strong communication and collaboration skills.
Salary Range: €235,000 – €355,000 EUR per year
Other Details:
-
Education: Bachelor’s or equivalent experience required.
-
Hybrid policy: ~25% in office; more as role demands.
-
Visa Sponsorship: Anthropic sponsors visas depending on the role.
