user avatar

AI Cloud Platform Site Reliability Engineer

Booz Allen Hamilton

Posted today

Job Requirements

Remote
Secret Polygraph Unspecified
Career Level not specified
$99,000 - $225,000

Job Description

Job Number: R0238292

AI Cloud Platform Site Reliability Engineer

The Opportunity :

Mission users are increasingly relying on agentic AI systems to support complex workflows, accelerate analysis, and improve decision advantage. Unlike traditional software systems, agentic AI platforms introduce operational complexity across model invocations, workflow orchestration, tool integrations, retrieval and knowledge layers, safety controls, and probabilistic outputs. As an AI Platform Site Reliability Engineer (SRE), you'll help ensure the availability, resiliency, observability, and operational integrity of an AWS GovCloud-based agentic AI platform supporting national defense missions.

In this role, you'll serve as the reliability owner for production AI operations. You'll work cross-functionally with multiple stakeholders, including with cloud engineering, platform engineering, AI agent development, MLOps, data science, and customer knowledge teams to operationalize their work in production through monitoring, alerting, Service Level Indicators (SLI) and Service Level Objectives (SLO) management, incident response, ticket triage, change control, and automation. You won't be duplicating model development, data science, or cloud platform build responsibilities. Instead, you'll ensure that the system, its agents, and their supporting services remain healthy, traceable, performant, and supportable in mission environments.

You'll define and monitor operational health signals across agent workflows, model latency, session and task success, knowledge- base ingestion health, tool and API dependencies, guardrail or safety interventions, throttling, token usage, drift indicators, and service degradation patterns. You'll help reduce operational toil by building dashboards, alarms, runbooks, and automated remediation workflows, while driving post-incident learning and continuous improvement.

How You'll Contribute :

  • Define, implement, and maintain service level indicators, service level objectives, error budgets, dashboards, alarms, and escalation paths for an agentic AI platform operating in AWS GovCloud.
  • Monitor end-to-end health and performance of agent workflows, model invocations, retrieval or knowledge integrations, orchestration steps, tool calls, and dependent services.
  • Triage incidents, alerts, and operational tickets. Lead root-cause analysis, coordinate recovery actions, and drive post-incident corrective actions that reduce mean time to recovery and prevent recurrence.
  • Build and maintain observability pipelines across metrics, logs, traces, audit telemetry, and operational events using AWS-native tooling and approved enterprise observability tooling.
  • Establish and tune operational thresholds for latency, availability, error rates, token and cost consumption, workflow success rates, tool failure rates, guardrail interventions, and drift-related signals.
  • Partner with platform engineers, cloud engineers, AI agent developers, MLOps engineers, data scientists, and customer SMEs to define ownership boundaries, handoffs, rollback criteria, release readiness gates, and operational support models.
  • Coordinate with MLOps and data science teams when model or data quality degradation, drift, or unexpected behavior requires rollback, retraining, prompt changes, knowledge-base updates, or other corrective actions.
  • Automate remediation and routine operational tasks using Python, shell scripting, infrastructure as code, and event-driven workflows to reduce manual toil.
  • Support secure and compliant operations in regulated national defense environments, including auditability, least-privilege access, controlled logging, and disciplined change management.
  • Work with limited direction, mentor junior team members, and help mature AI operations practices across the program.


Grow your skills at the leading edge of innovation.

Join us. The world can't wait.

You Have:

  • 5+ years of experience supporting production distributed systems such as SRE, Platform Engineering, Cloud Operations, or DevOps
  • Experience operating workloads on AWS including monitoring, alerting, logging, incident response, troubleshooting, IAM, networking, or secure operations
  • Experience supporting production AI/ML, generative AI, RAG, agentic AI, model-serving, or data-driven decision systems
  • Experience defining and operating SLIs, SLOs, error budgets, alert thresholds, runbooks, or operational readiness criteria
  • Experience with observability tooling across metrics, logs, traces, dashboards, or log analytics, including CloudWatch, OpenTelemetry, Prometheus, Grafana, OpenSearch, or ELK
  • Experience diagnosing issues across containers, orchestration platforms, or cloud runtimes, such as EKS, ECS, Lambda, or EC2
  • Experience with Python, Bash, or scripting languages to automate operational tasks, health checks, or remediation workflows
  • Experience participating in on-call rotations, triaging ticket queues, and leading incident response or post-incident review activities
  • Secret clearance
  • Bachelor's degree


Nice If You Have:

  • Experience with Amazon Bedrock, Bedrock Agents, Guardrails, Knowledge Bases, model invocation logging, EventBridge, CloudTrail, and CloudWatch-based monitoring for AI workloads or equivalent tooling for production agentic AI systems
  • Experience supporting AWS workloads in GovCloud, FedRAMP High, DoD SRG IL4/5, or other regulated or high-assurance environments
  • Experience with automation and infrastructure as code using Terraform, CloudFormation, or AWS CDK
  • Experience with CI/CD release engineering, canary strategies, rollback controls, and change management for cloud services and AI-enabled applications
  • Experience with Prometheus-compatible monitoring, Grafana, OpenSearch/ELK, or other enterprise observability stacks in containerized environments
  • Experience supporting GPU-backed inference, self-hosted model serving, or hybrid AI deployments if the platform evolves beyond managed services
  • Ability to distinguish infrastructure issues from AI-specific failure modes including workflow breakdowns, degraded retrieval, safety interventions, regressions, stale knowledge sources, and model or service throttling
  • Experience working in Agile and cross-functional environments and collaborating with engineers, operators, mission stakeholders, and technical leadership
  • AWS Certified CloudOps Engineer ,Associate AWS Certified DevOps Engineer, Professional AWS Certified Machine Learning Engineer, Associate AWS Certified Generative AI Developer, Professional AWS Certified Security, or Specialty cloud and AI operations Certifications
  • CompTIA Security+ or DoD 8570/8140 baseline Certification


Clearance:

Applicants selected will be subject to a security investigation and may need to meet eligibility requirements for access to classified information; Secret clearance is required.

Compensation

At Booz Allen, we celebrate your contributions, provide you with opportunities and choices, and support your total well-being. Our offerings include health, life, disability, financial, and retirement benefits, as well as paid leave, professional development, tuition assistance, work-life programs, and dependent care. Our recognition awards program acknowledges employees for exceptional performance and superior demonstration of our values. Full-time and part-time employees working at least 20 hours a week on a regular basis are eligible to participate in Booz Allen's benefit programs. Individuals that do not meet the threshold are only eligible for select offerings, not inclusive of health benefits. We encourage you to learn more about our total benefits by visiting the Resource page on our Careers site and reviewing Our Employee Benefits page.

Salary at Booz Allen is determined by various factors, including but not limited to location, the individual's particular combination of education, knowledge, skills, competencies, and experience, as well as contract-specific affordability and organizational requirements. The projected compensation range for this position is $99,000.00 to $225,000.00 (annualized USD). The estimate displayed represents the typical salary range for this position and is just one component of Booz Allen's total compensation package for employees. This posting will close within 90 days from the Posting Date.

Identity Statement

As part of the hiring process, we will ask you to complete an identity verification process that leverages advanced biometrics and artificial intelligence to ensure authenticity and protect against identity fraud. You are expected to be on camera during interviews and assessments. We reserve the right to take your picture to verify your identity and prevent fraud.

Candidate AI Usage Policy

AI is a part of our daily work at Booz Allen, and we are committed to the responsible and ethical use of AI tools. However, we want to ensure a fair candidate process based on your own skills and knowledge. As part of this commitment, the use of artificial intelligence (AI) or other tools to assist with responses during interviews (whether in-person or virtual) is prohibited unless permission is explicitly provided.

Work Model

Our people-first culture prioritizes the benefits of collaboration, whether it occurs in person or virtually. To support engagement and effective communication, employees working virtually are generally expected to have their cameras on during meetings.

  • Remote: If this position is listed as remote, there may still be occasions when you are required to work in person at a Booz Allen or customer facility.
  • Hybrid: If this position is listed as hybrid, you will be expected to work from a Booz Allen facility frequently, in alignment with leadership expectations and the needs of the role. You may also be required to work from or visit a customer facility.
  • Onsite: If this position is listed as onsite, work will primarily be performed at a Booz Allen office or customer facility, where employees will collaborate directly with colleagues and customers as required by the role.


Commitment to Non-Discrimination

All qualified applicants will receive consideration for employment without regard to disability, status as a protected veteran or any other status protected by applicable federal, state, local, or international law.
group id: booz

At Booz Allen, you’ll work at the forefront of advanced technology to uncover and solve the emerging challenges of our time. Change is within reach—and it starts with you.

job ad image
Find Booz Allen Hamilton on Social Media
Network Employers
user avatar
About Us
Booz Allen is an advanced technology company delivering outcomes with speed for America’s most critical defense, civil, and national security priorities. We build technology solutions using AI, cyber, and other cutting-edge technologies to advance and protect the nation and its citizens. By focusing on outcomes, we enable our people, clients, and their missions to succeed—accelerating the nation to realize our purpose: Empower People to Change the World®.
job ad2 image

Booz Allen Hamilton Jobs


Job Category
IT - Hardware
Clearance Level
Secret