Job Requirements
Springfield, VA
Top Secret/SCI CI Polygraph
Mid Level Career (5+ yrs experience)
$155,000 - $195,000
Job Description
On behalf of our Federal Contracting client, ClearanceJobs Talent Solutions Team is seeking a DevSecOps & Site Reliability Engineer to keep our client’s production deployed applications reliable, observable, and continuously improving in the field.
You will own the full lifecycle of platform operations: building resilient deployment
pipelines, instrumenting systems with deep telemetry, responding to incidents, and shepherding new releases from developer commits through regression testing and into production. You will sit alongside GEOINT analysts, and engineers, and your work will directly determine whether warfighters and intelligence professionals get the answers they need, when they need them.
An active Top Secret security clearance with SCI eligibility is required (active TS/SCI with CI Poly is preferred). Work is performed 100% on-site in Springfield, VA. There is a 24x7 on-call rotation associated with this position and potential for CONUS/OCONUS travel for deployments, exercises, and customer engagements.
Responsibilities:
• Build and operate resilient platforms. Design, deploy, and maintain containerized services on OpenShift and Kubernetes across AWS and on-premise edge hardware, across multiple classification environments, including air-gapped environments.
• Run distributed deployments. Execute and continuously improve the distributed deployment, configuration, and lifecycle management of applications across enterprise data centers and forward edge nodes.
• Observe and troubleshoot. Build, tune, and operate the Prometheus and Grafana observability stack - metrics, dashboards, alerts, and SLOs, and apply AIOps techniques to detect anomalies, correlate signals, and shorten time-to-resolution.
• Deliver releases under pressure. Rapidly install new software releases into development and test environments, execute formal regression test protocols, and coordinate scheduled deployments to production with zero or minimal user impact.
• Test, deploy, and scale services. Containerize, deploy, and horizontally scale services as part of an enterprise- or edge-ready architecture using Docker and Kubernetes across cloud and on-premise edge deployable hardware.
• Integrate identity and access. Implement and maintain integrations with enterprise identity providers including GEOAxIS, Microsoft Entra ID, and other PKI/SAML/OIDC infrastructure.
• Run the help desk loop. Triage, document, and resolve user-reported incidents using ticketing systems; maintain runbooks, postmortems, and knowledge-based articles that raise the floor for the whole support team.
• Collaborate across the mission. Work shoulder-to-shoulder with GEOINT analysts, engineers and researchers to translate operational needs into a deployable capability, and to translate field issues back into engineering fixes.
• Harden and secure. Apply DevSecOps practices end-to-end: vulnerability scanning, hardened base images, secrets management, STIG compliance, and continuous accreditation maintenance for ATO-sustained environments.
• Support 24/7 worldwide users. Participate in an on-call rotation supporting users across multiple time zones and combatant commands; respond decisively to outages, degradations, and high-priority operational events.
Required Education / Experience:
• 5+ years of professional experience in DevOps, SRE, DevSecOps, or production O&M roles supporting distributed software systems.
• Active TS/SCI clearance
• Container platforms: Strong hands-on experience with OpenShift and Kubernetes, workloads, operators, ingress, networking, storage, and upgrades.
• Cloud: Practical AWS experience (EC2, EKS, S3, IAM, VPC, CloudWatch); experience working in GovCloudstyle restricted environments.
• Distributed deployment & O&M: Demonstrated ownership of production deployments, configuration management, patching, and lifecycle operations for distributed systems.
• Observability: Production-grade Prometheus and Grafana experience, exporters, recording rules, alerting, dashboards, and SLO-driven operations.
• AIOps: Familiarity with anomaly detection, log/metric correlation, and automated remediation patterns to reduce MTTR.
• Containers & CI/CD: Fluency with Docker, image hardening, and CI/CD pipelines (GitLab CI, Jenkins, GitHub Actions, or equivalent).
• Identity management: Working familiarity with GEOAxIS, Microsoft Entra ID, PKI/CAC, SAML, and OIDC integration patterns.
• Ticketing & incident management: Experience operating a help desk / incident workflow.
• Linux & scripting: Strong Linux administration skills and proficiency in Bash plus at least one of Python, Go, or equivalent.
Preferred Qualifications
• Direct experience supporting GEOINT, IMINT, or all-source intelligence production environments.
• Experience with edge or tactical deployments operating under DIL/DDIL constraints.
• Knowledge of ICD 503, RMF, and ATO sustainment processes.
• Hands-on with Kafka, OpenSearch, Elasticsearch, or similar distributed data platforms.
• Experience deploying or operating ML/AI inference workloads, including GPU-accelerated services.
• Infrastructure-as-Code: Terraform, Ansible, or Helm at production scale.
Work Environment
• On-site work at a customer facility is required.
• Participation in a 24/7 on-call rotation.
• Occasional travel to additional CONUS and OCONUS sites for deployments, exercises, and customer engagements.
• Work is performed on classified networks; standard security and handling protocols apply.
You will own the full lifecycle of platform operations: building resilient deployment
pipelines, instrumenting systems with deep telemetry, responding to incidents, and shepherding new releases from developer commits through regression testing and into production. You will sit alongside GEOINT analysts, and engineers, and your work will directly determine whether warfighters and intelligence professionals get the answers they need, when they need them.
An active Top Secret security clearance with SCI eligibility is required (active TS/SCI with CI Poly is preferred). Work is performed 100% on-site in Springfield, VA. There is a 24x7 on-call rotation associated with this position and potential for CONUS/OCONUS travel for deployments, exercises, and customer engagements.
Responsibilities:
• Build and operate resilient platforms. Design, deploy, and maintain containerized services on OpenShift and Kubernetes across AWS and on-premise edge hardware, across multiple classification environments, including air-gapped environments.
• Run distributed deployments. Execute and continuously improve the distributed deployment, configuration, and lifecycle management of applications across enterprise data centers and forward edge nodes.
• Observe and troubleshoot. Build, tune, and operate the Prometheus and Grafana observability stack - metrics, dashboards, alerts, and SLOs, and apply AIOps techniques to detect anomalies, correlate signals, and shorten time-to-resolution.
• Deliver releases under pressure. Rapidly install new software releases into development and test environments, execute formal regression test protocols, and coordinate scheduled deployments to production with zero or minimal user impact.
• Test, deploy, and scale services. Containerize, deploy, and horizontally scale services as part of an enterprise- or edge-ready architecture using Docker and Kubernetes across cloud and on-premise edge deployable hardware.
• Integrate identity and access. Implement and maintain integrations with enterprise identity providers including GEOAxIS, Microsoft Entra ID, and other PKI/SAML/OIDC infrastructure.
• Run the help desk loop. Triage, document, and resolve user-reported incidents using ticketing systems; maintain runbooks, postmortems, and knowledge-based articles that raise the floor for the whole support team.
• Collaborate across the mission. Work shoulder-to-shoulder with GEOINT analysts, engineers and researchers to translate operational needs into a deployable capability, and to translate field issues back into engineering fixes.
• Harden and secure. Apply DevSecOps practices end-to-end: vulnerability scanning, hardened base images, secrets management, STIG compliance, and continuous accreditation maintenance for ATO-sustained environments.
• Support 24/7 worldwide users. Participate in an on-call rotation supporting users across multiple time zones and combatant commands; respond decisively to outages, degradations, and high-priority operational events.
Required Education / Experience:
• 5+ years of professional experience in DevOps, SRE, DevSecOps, or production O&M roles supporting distributed software systems.
• Active TS/SCI clearance
• Container platforms: Strong hands-on experience with OpenShift and Kubernetes, workloads, operators, ingress, networking, storage, and upgrades.
• Cloud: Practical AWS experience (EC2, EKS, S3, IAM, VPC, CloudWatch); experience working in GovCloudstyle restricted environments.
• Distributed deployment & O&M: Demonstrated ownership of production deployments, configuration management, patching, and lifecycle operations for distributed systems.
• Observability: Production-grade Prometheus and Grafana experience, exporters, recording rules, alerting, dashboards, and SLO-driven operations.
• AIOps: Familiarity with anomaly detection, log/metric correlation, and automated remediation patterns to reduce MTTR.
• Containers & CI/CD: Fluency with Docker, image hardening, and CI/CD pipelines (GitLab CI, Jenkins, GitHub Actions, or equivalent).
• Identity management: Working familiarity with GEOAxIS, Microsoft Entra ID, PKI/CAC, SAML, and OIDC integration patterns.
• Ticketing & incident management: Experience operating a help desk / incident workflow.
• Linux & scripting: Strong Linux administration skills and proficiency in Bash plus at least one of Python, Go, or equivalent.
Preferred Qualifications
• Direct experience supporting GEOINT, IMINT, or all-source intelligence production environments.
• Experience with edge or tactical deployments operating under DIL/DDIL constraints.
• Knowledge of ICD 503, RMF, and ATO sustainment processes.
• Hands-on with Kafka, OpenSearch, Elasticsearch, or similar distributed data platforms.
• Experience deploying or operating ML/AI inference workloads, including GPU-accelerated services.
• Infrastructure-as-Code: Terraform, Ansible, or Helm at production scale.
Work Environment
• On-site work at a customer facility is required.
• Participation in a 24/7 on-call rotation.
• Occasional travel to additional CONUS and OCONUS sites for deployments, exercises, and customer engagements.
• Work is performed on classified networks; standard security and handling protocols apply.
group id: ClearanceJobsSC