user avatar

Principal Site Reliability Engineer (AI Infrastructure)

Pavati Solutions LLC

Posted today

Job Requirements

Saint Louis, MO Springfield, VA
Top Secret/SCI Polygraph not specified
Senior Level Career (10+ yrs experience)
$265,000 - $345,000

Job Description

The Mission

Pavati Solutions bridges the gap between commercial AI innovation and national security. While the core AI platform is already in flight, you will lead the engineering of the Reliability and Observability Ecosystem required to sustain it. We aren't looking for a passive maintainer; we need an architect to own the cloud-native infrastructure that ensures this critical mission remains resilient and performane.
We are looking for a Principal SRE who wants to own the "how" of production excellence. You aren't just joining a team; you are helping lead a movement to bring world class engineering standards to the mission.

Why This is the "Engineer’s Choice":

Peer Excellence: You won't be an island. You will be teamed up with experienced SRE mentors and architects from the commercial tech industry who bring best practices from the world’s most scalable platforms.
Architectural Sovereignty: While the AI platform is built, you own the SRE Tech Stack. You choose the observability tools, define the automation patterns, and set the "Production Ready" standards.
Coding > Toil: We believe that every manual task is a bug. You are empowered to spend your time engineering software defined solutions to infrastructure challenges.

What You’ll Lead

The Reliability Standard: Select and implement the tech stack for monitoring and alerting (e.g., Prometheus, Grafana, OpenTelemetry). You define what "healthy" looks like for an AI model in production.
Automated Governance: Build custom tooling and automated remediation in Python, Java, or C++ to ensure the platform scales without increasing headcount.
Cloud Architecture: Optimize high compute cloud environments to support massive AI inference and training workloads.
Collaborative Leadership: Partner with our AI Developers to ensure the platform is "architected for ops," bridging the gap between a model working in a lab and a model working in the field.
Who You Are
A Systems Architect (12+ years): You have a background in Software Engineering and have successfully scaled cloud native production environments.
A Lifelong Learner: You value the opportunity to collaborate with SRE veterans from the commercial sector to stay at the bleeding edge of the discipline.
The "Zero-to-One" Thinker: You enjoy the challenge of walking into an environment and building the reliability standards, tooling, and culture from the ground up.
Technical Pillars
AWS Mastery: Deep expertise in AWS, focusing on resilience, performance, and security first architecture.
OpenShift/Kubernetes Expert: Mastery of OpenShift and the Kubernetes ecosystem, including service meshes, operators, and ingress management.
Software Development: High proficiency in a major backend language (Python, Java, or C++) for building internal platform tooling and automation.
Modern Observability: Expert experience with the Prometheus/Grafana ecosystem and distributed tracing.
Infrastructure as Code: Expert command of Terraform and familiarity with modern cloud automation frameworks.

About Pavati

Pavati is focused on closing the gap between commercial innovation and government adoption. We don’t do "business as usual." We help advanced technologies move quickly and securely into operational use by cutting through the noise and focusing on practical, high impact outcomes. We work alongside government and academic partners to deliver AI capabilities through iterative, mission focused development. If you want to spend more time architecting and less time navigating bureaucracy, we want to talk.
group id: 91173635
N
Name HiddenFounder & CEO