Posted Date: Jan 13, 2026
Closing Date: Feb 10, 2026

Full Time
France
Posted on January 13, 2026

Mistral AI

Frontier AI — empowering developers and enterprises with open and powerful generative AI

Job Description

This is a full-time, hybrid role embedded directly within the Product team, focused on building, measuring, and safely operating production-grade AI systems. The role centers on improving the quality, reliability, safety, and performance of large language model–powered products through strong evaluation, experimentation, and observability practices.

You will be responsible for designing and maintaining a comprehensive LLM evaluation framework, including reference tests, heuristic checks, and model-graded evaluations. A core part of the role is defining and tracking meaningful metrics such as task success, helpfulness, hallucination proxies, safety signals, latency, and cost. These metrics will guide decision-making around model releases and product improvements.

You will run A/B tests across prompts, models, and system configurations, analyze results, and make data-driven recommendations on rollout, iteration, or rollback. You will also build end-to-end observability for LLM usage, implementing structured logging, tracing, dashboards, and alerts to ensure visibility into system behavior in production.

Operating the model release process is a key responsibility. This includes managing canary and shadow deployments, defining sign-off criteria, enforcing service-level objectives, detecting regressions, and leading safe rollbacks when needed. You will work closely with research and science teams to diagnose issues, run post-mortems, and ship measurable improvements across quality, latency, safety, and reliability.

Beyond infrastructure and metrics, you will help improve core AI behaviors such as memory write and retrieval policies, intent classification, follow-up handling, routing logic, and tool-call reliability. You will document best practices and create reusable templates so other teams can confidently author evaluations and ship changes safely.

The role requires strong programming skills in TypeScript or Python and hands-on experience working with production LLM systems, including prompt design, tool or function calling, and system prompts. You should be comfortable designing experiments, interpreting data, and iterating quickly with a product-focused mindset. Clear communication, autonomy, and the ability to collaborate effectively across product, engineering, and research teams are essential.

Experience with AI safety systems such as moderation, guardrails, or sensitive data handling is a strong plus, as is familiarity with release operations like canarying, shadow traffic, automated rollbacks, and experimentation platforms.

The position is primarily based in Paris, France, with a strong preference for in-person collaboration. Candidates located in other eligible European countries may be considered in specific cases, with regular visits to the Paris office required. This is a full-time role offering competitive compensation, equity, benefits, and the opportunity to work at the forefront of AI product development in a fast-moving, research-driven environment.

! You must be logged in to apply for this job.

To apply for this job please visit jobs.lever.co.