Job Requirements
Washington, DC
Public Trust Polygraph Unspecified
Career Level not specified
Salary not specified
Join Premium to unlock estimated salaries
Job Description
Job title : Senior Cloud Observability Engineer - Data Dog
Location: Washington, D.C., 20549 (100 % Onsite)
Duration: 6 Months
Salary Range: $58.00 - $60.00/Hour on W2 (Without Benefits).
Applicants must be willing to work on W2.
Clearance : Ability to obtain and maintain SEC Public Trust (or higher if required).
Primary Responsibilities:
Observability Platform Engineering:
Education:
Location: Washington, D.C., 20549 (100 % Onsite)
Duration: 6 Months
Salary Range: $58.00 - $60.00/Hour on W2 (Without Benefits).
Applicants must be willing to work on W2.
Clearance : Ability to obtain and maintain SEC Public Trust (or higher if required).
Primary Responsibilities:
Observability Platform Engineering:
- Engineer and operate the enterprise observability stack (Datadog or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring.
- Build, tune, and maintain dashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noise.
- Instrument services, infrastructure, and containerized workloads using agents, OpenTelemetry, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3C TraceContext propagation, and unified service tagging across the estate.
- Develop and maintain integrations between observability platforms, ITSM (ServiceNow), CI/CD pipelines, and on call/paging workflows.
- Define and enforce a unified tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to keep telemetry queryable, attributable, and cost controlled.
- Design and deliver monitoring coverage for Microsoft Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data services.
- Engineer managed database observability across AWS RDS/Aurora (MySQL, PostgreSQL, SQL Server, Oracle), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB, ElastiCache/Redis), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, connection pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM traces.
- Engineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes, pods, namespaces, ingress, service mesh, and workload APM.
- Build standardized, reusable monitoring modules deployable via infrastructure-as-code (Terraform, Bicep, ARM) and CI/CD.
- Support hybrid visibility across on-premises, cloud, and containerized workloads with correlated telemetry.
- Lead data-driven investigation and resolution of complex performance, latency, saturation, and reliability issues across the estate.
- Use APM distributed traces, service/dependency maps, continuous code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM-to-backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and downstream dependencies.
- Partner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidence.
- Define and implement trace-based SLOs, deployment tracking, and change-correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changes.
- Provide senior technical leadership during major incidents, delivering impact analysis, contributing to root-cause analysis, and owning post-incident observability gaps.
- Analyze operational telemetry and trend data to identify capacity risks, recurring constraints, and opportunities for efficiency.
- Build and maintain capacity and performance dashboards and reports that communicate posture, risk, and recommendations to technical and leadership stakeholders.
- Define capacity thresholds, alert baselines, and trigger points for scaling, technology refresh, and resource reallocation.
- Drive continuous improvement of observability coverage, alert quality, runbook linkage, and operational maturity aligned to SEC SLA/KPI expectations.
Education:
- Bachelor's degree in a relevant field (e.g., Information Technology, Computer Science, Engineering).
- Minimum 8 years of experience in IT infrastructure or platform engineering roles, including 5+ years focused on observability, performance engineering, or site reliability engineering.
- Demonstrated experience engineering and operating an enterprise observability platform (Datadog strongly preferred; equivalent experience with Dynatrace, New Relic, Splunk Observability, or Grafana/Prometheus stacks considered).
- Proven experience building APM and distributed tracing coverage for production multi-tier applications -- including language-specific tracer deployment, custom instrumentation of business transactions, service/dependency mapping, continuous profiling, and RUM-to-backend trace correlation -- across cloud and containerized workloads.
- Proven experience leading complex production performance and reliability problem-solving from telemetry to remediation.
group id: artech