Job Description

This is a remote position.

We are looking for a Senior Site Reliability Engineer to support advanced AI platforms responsible for production-grade applications and pipelines. The role focuses on building and maintaining reliability, scalability, and operational excellence across multiple AI-driven systems.

The engineer will work on a central operational layer for monitoring and managing AI workloads, improving system stability, and reducing incidents. This is a hands-on role requiring direct involvement in diagnosing production issues, implementing fixes, and optimising monitoring, alerting, and CI/CD processes.

The position requires close collaboration with engineering teams to improve release quality, standardise telemetry, and ensure stable and predictable system behaviour in a distributed cloud environment.

Responsibilities

Build and maintain central monitoring and alerting layer for AI applications and pipelines
Define and implement SLIs, alerts, and operational dashboards
Manage incidents including triage, coordination, root cause analysis, and prevention
Standardise telemetry across systems including latency, throughput, and failures
Optimise CI CD pipelines and introduce quality gates for reliability
Work closely with engineering teams to reduce recurring issues and improve stability

Requirements

Minimum 5+ years of experience in SRE, Platform, or Production Engineering
Strong hands on experience with Kubernetes and production environments
Experience with Azure and Azure DevOps
Experience with monitoring tools such as Datadog
Strong understanding of incident management and root cause analysis
Ability to build practical monitoring and alerting systems

Nice to have

Experience with AI or LLM pipelines
Experience building monitoring platforms across multiple systems
Experience with Grafana
Experience working in large scale or distributed environments

Expectations

Strong ownership mindset and accountability for system stability
Proactive approach to identifying risks and improvements
Hands on engineer actively working with systems, not only coordinating
Comfortable working in dynamic and evolving environments

Benefits

Solid, competitive salary
Work in a multinational environment on international projects
Comprehensive healthcare
Long-term B2B contract with a stable project pipeline
Work model: fully remote

Site Reliability Engineer (AI)

Madiff