user avatar

Cloud Engineer - Senior (Observability)

Leidos

Posted today

Job Requirements

Remote
Public Trust Polygraph Unspecified
Career Level not specified
$87,100 - $157,450

Job Description

R-00183575

Description

The Cloud Engineer - Senior (Observability) supports the SEC ISS contract by engineering,   operating , and continuously improving the enterprise observability platform across hybrid cloud and containerized environments. This role is hands-on: instruments services with distributed tracing, code-level profiling, and custom metrics; builds and tunes Datadog (or comparable) dashboards, alerts, APM, log pipelines, RUM, and synthetic monitors; then uses that telemetry to solve production performance, reliability, and capacity problems. The engineer partners with cloud, platform, and application teams to embed observability into Azure, AWS, and container platforms (OpenShift/Kubernetes), and drives reduction of alert noise, mean time to detect (MTTD), and mean time to resolve (MTTR). This position provides senior technical leadership for APM/distributed tracing strategy, SLO/SLI engineering, and data-driven operational decision-making in a 24x7x365 operating environment.  

 

PRIMARY RESPONSIBILITIES  

 

Observability Platform Engineering  

- Engineer and   operate   the enterprise observability stack (Datadog or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring.  

- Build, tune, and   maintain   dashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noise.  

- Instrument services, infrastructure, and containerized workloads using agents,   OpenTelemetry , and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3C   TraceContext   propagation, and unified service tagging across the estate.  

- Develop and   maintain   integrations between observability platforms, ITSM (ServiceNow), CI/CD pipelines, and on-call/paging workflows.  

- Define and enforce a unified tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to keep telemetry   queryable , attributable, and cost-controlled.  

 

Cloud and Container Monitoring Engineering  

- Design and deliver monitoring coverage for Microsoft Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data services.  

- Engineer managed database observability across AWS RDS/Aurora (MySQL, PostgreSQL, SQL Server, Oracle), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB,   ElastiCache /Redis), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, connection pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM traces.  

- Engineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes, pods, namespaces, ingress, service mesh, and workload APM.  

- Build standardized, reusable monitoring modules deployable via infrastructure-as-code (Terraform, Bicep, ARM) and CI/CD.  

- Support hybrid visibility across on-premises, cloud, and containerized workloads with correlated telemetry.  

 

Performance Engineering and Problem Solving  

- Lead data-driven investigation and resolution of complex performance, latency, saturation, and reliability issues across the estate.  

- Use APM distributed traces, service/dependency maps, continuous code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM-to-backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and downstream dependencies.  

- Partner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidence.  

- Define and implement trace-based SLOs, deployment tracking, and change-correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changes.  

- Provide senior technical leadership during major incidents, delivering impact analysis, contributing to root-cause analysis, and owning post-incident observability gaps.  

 

Capacity, Reliability, and Continuous Improvement  

- Analyze operational telemetry and trend data to   identify   capacity risks, recurring constraints, and opportunities for efficiency.  

- Build and   maintain   capacity and performance dashboards and reports that communicate posture, risk, and recommendations to technical and leadership stakeholders.  

- Define capacity thresholds, alert baselines, and trigger points for scaling, technology refresh, and resource reallocation.  

- Drive continuous improvement of observability coverage, alert quality, runbook linkage, and operational maturity aligned to SEC SLA/KPI expectations.  

 

REQUIRED QUALIFICATIONS  

 

Citizenship/Work Authorization: Must meet contract requirements .    

Clearance: Ability to obtain and   maintain   SEC Public Trust (or higher if   required ) .    

 

EXPERIENCE

- Minimum 8 years of experience in IT infrastructure or platform engineering roles, including 5+ years focused on observability, performance engineering, or site reliability engineering.  

-   Demonstrated   experience engineering and   operating   an enterprise observability platform (Datadog strongly preferred; equivalent experience with Dynatrace, New Relic, Splunk Observability, or Grafana/Prometheus stacks considered).  

- Proven experience building APM and distributed tracing coverage for production multi-tier applications -- including language-specific tracer deployment, custom instrumentation of business transactions, service/dependency mapping, continuous profiling, and RUM-to-backend trace correlation -- across cloud and containerized workloads.  

- Proven experience leading complex production performance and reliability problem-solving from telemetry to remediation.  

- Hands-on experience monitoring Kubernetes or OpenShift clusters and containerized workloads in production.  

 

TECHNICAL SKILLS

- Enterprise observability platforms (Datadog or comparable): metrics, logs, traces, APM, RUM, synthetic, NPM  

- Instrumentation with   OpenTelemetry , Datadog agents/SDKs, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) including custom spans, trace sampling strategies, W3C   TraceContext   propagation, and continuous profiling  

- Microsoft Azure and AWS monitoring services and integrations (Azure Monitor, Log Analytics, CloudWatch, AWS X-Ray)  

- Container and Kubernetes/OpenShift observability, including cluster, workload, and service mesh telemetry  

- Cloud database monitoring: AWS RDS/Aurora (including Performance Insights), Azure SQL/PostgreSQL/MySQL (Query Performance Insight), and NoSQL/cache (DynamoDB, Cosmos DB,   ElastiCache /Redis); query-level performance tuning, execution-plan analysis, and Datadog DBM or equivalent deep database APM  

- Infrastructure-as-code for monitoring (Terraform, Bicep, ARM) and CI/CD-driven monitor/dashboard deployment  

- APM and distributed tracing: service/dependency maps, trace analytics, RUM-to-backend correlation, exception/error tracking, deployment tracking, and trace-based SLOs  

- Unified tagging strategy and cardinality governance across metrics/logs/traces (environment, service, version, ownership, data classification, cost center), including custom tag enrichment and tag-driven access/cost controls  

- Alert engineering, SLO/SLI design, error budget management, and alert-noise reduction  

- Performance engineering, capacity analysis, and telemetry-driven root-cause analysis  

- Integration of observability with ITSM (ServiceNow) and on-call/paging workflows  

 

PREFERRED QUALIFICATIONS  

- Experience supporting federal agency IT environments under FISMA/FedRAMP/NIST-aligned security and compliance requirements.  

- Datadog certification (Fundamentals and/or Administrator) or comparable enterprise observability certification.  

- Hands-on experience with Red Hat OpenShift Virtualization (CNV/ KubeVirt ) or   other   KubeVirt -based container virtualization observability.  

- Experience with   eBPF -based observability tooling and service mesh telemetry (Istio,   Linkerd ).  

- Experience implementing SLOs and error budgets at enterprise scale and integrating them into operational governance.  

- Experience with cost-aware observability practices, including telemetry volume optimization and retention tuning.  

- Experience integrating observability outputs with executive reporting, SLA/KLI dashboards, and capacity forecasting.  

- ITIL 4 Foundation  

- AWS Certified Solutions Architect - Associate (or higher)  

- Microsoft Certified: Azure Administrator Associate (or higher)  

- Red Hat Certified Specialist in OpenShift Administration (or equivalent)  

-   HashiCorp   Terraform Associate  

 

WORK ENVIRONMENT / OTHER  

 

Operational Support: Supports a 24x7x365 operating environment;   participates   in a defined on-call rotation and may require surge support based on operational needs .    

Location: Telework    

Travel: As required per contract direction.

EDUCATION & EXPERIENCE

BS and 4 – 8 years of prior relevant experience or Masters with 2 – 6 years of prior relevant experience. Preferred degree in a relevant field (e.g., Information Technology, Computer Science, Engineering).

If you're looking for comfort, keep scrolling. At Leidos, we outthink, outbuild, and outpace the status quo — because the mission demands it. We're not hiring followers. We're recruiting the ones who disrupt, provoke, and refuse to fail. Step 10 is ancient history. We're already at step 30 — and moving faster than anyone else dares.
Original Posting: May 19, 2026

For U.S. Positions: While subject to change based on business needs, Leidos reasonably anticipates that this job requisition will remain open for at least 3 days with an anticipated close date of no earlier than 3 days after the original posting date as listed above.

Pay Range: Pay Range $87,100.00 - $157,450.00

The Leidos pay range for this job level is a general guideline only and not a guarantee of compensation or salary. Additional factors considered in extending an offer include (but are not limited to) responsibilities of the job, education, experience, knowledge, skills, and abilities, as well as internal equity, alignment with market data, applicable bargaining agreement (if any), or other law.

About Leidos

Leidos is an industry and technology leader serving government and commercial customers with smarter, more efficient digital and mission innovations. Headquartered in Reston, Virginia, with 47,000 global employees, Leidos reported annual revenues of approximately $16.7 billion for the fiscal year ended January 3, 2025. For more information, visit www.Leidos.com .

Pay and Benefits

Pay and benefits are fundamental to any career decision. That's why we craft compensation packages that reflect the importance of the work we do for our customers. Employment benefits include competitive compensation, Health and Wellness programs, Income Protection, Paid Leave and Retirement. More details are available at www.leidos.com/careers/pay-benefits .

Securing Your Data

Beware of fake employment opportunities using Leidos’ name. Leidos will never ask you to provide payment-related information during any part of the employment application process (i.e., ask you for money), nor will Leidos ever advance money as part of the hiring process (i.e., send you a check or money order before doing any work). Further, Leidos will only communicate with you through emails that are generated by the Leidos.com automated system – never from free commercial services (e.g., Gmail, Yahoo, Hotmail) or via WhatsApp, Telegram, etc. If you received an email purporting to be from Leidos that asks for payment-related information or any other personal information (e.g., about you or your previous employer), and you are concerned about its legitimacy, please make us aware immediately by emailing us at LeidosCareersFraud@leidos.com .

If you believe you are the victim of a scam, contact your local law enforcement and report the incident to the U.S. Federal Trade Commission .

Commitment to Non-Discrimination

All qualified applicants will receive consideration for employment without regard to sex, race, ethnicity, age, national origin, citizenship, religion, physical or mental disability, medical condition, genetic information, pregnancy, family structure, marital status, ancestry, domestic partner status, sexual orientation, gender identity or expression, veteran or military status, or any other basis prohibited by law. Leidos will also consider for employment qualified applicants with criminal histories consistent with relevant laws.

#Remote
group id: SCNCAPI2

Introducing the Next Level of Leidos

job ad image
Find Leidos on Social Media
Network Employers
user avatar
About Us
Leidos is a Fortune 500® technology, engineering, and science solutions and services leader working to solve the world’s toughest challenges in the defense, intelligence, civil, and health markets. The company’s 43,000 employees support vital missions for government and commercial customers. Headquartered in Reston, Virginia, Leidos reported annual revenues of approximately $13.7 billion for the fiscal year ended December 31, 2021. For more information, visit www.Leidos.com.
job ad2 image

Leidos Jobs


Job Category
IT - Hardware
Clearance Level
Public Trust
Employer
Leidos