user avatar

Senior Site Reliability Engineer - CTJ - Poly

Microsoft Corporation

Today
Top Secret
Unspecified
Polygraph
Engineering - Mechanical
Redmond, WA (On-Site/Office)Atlanta, GA (On-Site/Office)

Are you interested in shaping the future of Microsoft 365 products that empower our customers to seamlessly create, collaborate, and share within government cloud environments? In this role, you will leverage your expertise in software development, online services, and AI to envision, design, and improve upon next-generation Microsoft 365 government cloud service offerings.

The Site Reliability Engineering (SRE) team provides leadership, direction and accountability for application architecture, system design, and end-to-end implementation. As a Senior Site Reliability Engineering, you will identify and deliver software improvements using your expertise in software development, AI, complexity analysis, and scalable system design.

Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Responsibilities

Technical Knowledge and Domain-Specific Expertise
  • Demonstrates end-to-end expertise in distributed systems design, interactions between cloud technology layers and components, functions of physical network devices, and dependencies at scale. Drives efforts within an organization to identify and recommend optimal configurations of cloud technology solutions and develops or modifies the code base that defines infrastructures to improve the reliability and operability of supported products.
  • Develops end-to-end technical expertise in the architecture, code, features, and operations of specific products as required to implement improvements in product availability, reliability, efficiency, observability, and/or performance. Drives code/design reviews with the engineering teams that develop and/or manage those products and shares learnings and recommendations across engineering teams working on related products within their organization.
  • Researches and maintains deep knowledge of industry trends as well as advances in large-scale distributed systems and cloud technologies; identifies opportunities to create, implement, and/or optimally utilize new tools, technologies, and/or processes to solve ambiguous problems and improve product availability, reliability, efficiency, observability, and/or performance. Drives the adoption of new solutions across engineering teams working with related products within an organization and provides guidance and coaching to others on relevant topics.

Contributions to Development and Design
  • Leverages technical expertise in the infrastructure of large scale distributed systems and specific products, as well as objective insights drawn from analyses of production telemetry data to advocate for, or directly contribute to, changes to the code base to improve the availability, reliability, efficiency, observability, and performance of related sets of products developed and supported by teams within an organization.
  • Develops, tests, and implements changes to optimize code and improve the observability, reliability and operability of platforms, systems, and products at scale. Reviews the effect of these changes to document and share development insights within their team.
  • Engages with product engineering teams within an organization by driving code/design reviews, hosting regular meetings, and participating in on-call rotations and incident responses throughout product development and operations cycles; leverages end-to-end technical expertise on underlying systems/platforms and insights from engagements with product engineering teams and telemetry analyses to propose scalable improvements in code and designs with attention to customer/business objectives and incident prevention.

Driving Operational Excellence
  • Develops code, scripts, systems, or platforms that automate moderately complex but repetitive operations processes (e.g., monitoring, alerting, deploying products and updates, debugging) at scale; reviews existing automation code and scripts to evaluate reusability, extendibility, and scalability within an organization.
  • Leverages end-to-end technical expertise and telemetry analysis to identify patterns and opportunities to implement configuration and data changes for related sets of platforms, systems, or products in production using code, tooling, and automation; identifies cases where teams lack the tools and/or capability to manage platforms, systems, or products using code and drives efforts within an organization to expand capabilities and/or tooling accordingly.
  • Leverages existing tools and automation to enable product engineering teams within their organization to increase the velocity in which they can reliably and safely implement changes in production; monitors the effects of changes across platforms or systems.
  • Analyzes data from telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of systems, platforms, or products operating at scale. Contributes to the development of new tooling and/or predictive models to identify and test potential improvements in product development and/or operations, and monitors the impact of changes on operations metrics (e.g., Time-to-X) within an organization.
  • Identifies optimal uses for existing tools and/or models to identify contributing factors or points of failure that are affecting the availability, reliability, performance, and/or efficiency of systems, platforms, or products; proposes and implements solutions that resolve root cause(s) and prevent issues from occurring in related products by working with product engineering teams within an organization to test and deploy them to production.
  • Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting complex issues, and deploying appropriate fixes to resolve root cause(s); alerts product teams, owners, and leadership to issues with major customer/business impact and escalates resolution of the highly complex, ambiguous, and impactful issues to include other engineering teams and/or subject matter experts as needed. Shares details related to incidents and their resolution through post-mortem reports and during regular review meetings.
  • Develops, maintains , and leverages capacity planning models and monitoring tools to forecast product capacity and resource demands; models the predicted effect of changes to capacity plans to optimize code bases to better manage resources in respond to dynamic capacity demands. May contribute to the development of automated resource utilization tools or processes that can dynamically scale compute resources up or down to adjust to capacity demands.
  • Draws insights from performance and resource monitoring across products within their organization to identify whether there is a need to optimize code, infrastructure, or architecture - or if changes to compute resources are required; uses advanced models to forecast and verify the efficacy of changes at scale and proposes solutions that are aligned with customer/business needs.
  • Shares insights and best practices that can be applied to improve development and operations across related sets of systems, platforms, and/or products. Continues to develop their understanding of insights and best practices through interactions with more experienced SREs and members of product engineering teams. Mentors and coaches more engineers to help them identify and propose relevant solutions.

Additional Responsibilities
  • Design, develop, and deliver engineering solutions that serve and protect M365 government clouds.
  • Own deployment, availability, reliability, performance and customer escalation targets for sovereign environments.
  • Proactively identify and reduce issues through design, testing, and implementation of software-based solutions.
  • Collaborate with Engineering and Program Management partners to translate customer, business, and technical requirements into architectural designs and feature releases.
  • Drive efficiencies through software improvement and root cause analysis resulting in service delivery, maturity, and scalability.
  • Develop, test, and implement changes to optimize code and improve platforms. You leverage end-to-end technical expertise and telemetry analysis to identify patterns and opportunities to implement configuration and data changes. You review the effect of changes to documents and share development insights within your team. You drive code/design reviews, host regular meetings, and participate in on-call rotations and incident responses throughout product development and operations cycles.
  • In addition, you respond to incidents during regular on-call rotations and share details related to incidents and their resolution through post-mortem reports and regular review meetings.

Other
  • E mbody our culture and values


Qualifications

Required/Minimum Qualifications:
  • 6 + years technical experience in software engineering, network engineering, or systems administration
    • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 3 + years technical experience in software engineering, network engineering, or systems administration
    • OR Master's D egree in Computer Science, Information Technology, or related field AND 2 + years technical experience in software engineering, network engineering, or systems administration .

Other Requirements:
  • Security Clearance Requirements: Candidates must be able to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
    • Candidates must have an active TS and be willing to upgrade to TS/SCI (with polygraph) or have an active TS/SCI and be willing to upgrade to TS/SCI (with polygraph). This role will require candidates to maintain the TS/SCI (with polygraph) clearance. Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. Failure to maintain or obtain the appropriate clearance and/or customer screening requirements may result in employment action up to and including termination.
    • Clearance Verification: This position requires successful verification of the stated security clearance to meet federal government customer requirements. You will be asked to provide clearance verification information prior to an offer of employment.
    • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
  • Citizenship & Citizenship Verification: This position requires verification of U.S. citizenship due to citizenship-based legal restrictions. Specifically, this position supports United States federal, state, and/or local United States government agency customer and is subject to certain citizenship-based restrictions where required or permitted by applicable law. To meet this legal requirement, citizenship will be verified via a valid passport, or other approved documents, or verified US government Clearance

Preferred/Additional Qualifications:
  • 7 + years technical experience in software engineering, network engineering, or systems administration
    • OR Bachelor's D egree in Computer Science, Information Technology, or related field AND 4 + years technical experience in software engineering, network engineering, or systems administration
    • OR Master's D egree in Computer Science, Information Technology, or related field AND 3 + years technical experience in software engineering, network engineering, or systems administration
    • OR Doctorate D egree in Computer Science, Information Technology, or related field .

Site Reliability Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800 - $234,700 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $158,400 - $258,000 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

Microsoft will accept applications for the role until June 26, 2025

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form .

Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.
group id: microwa

Jobs at Microsoft: Pursuing a Career with Global Impact

job ad image
Find Microsoft Corporation on Social Media
Network Employers
user avatar
About Us
At Microsoft, we're motivated and inspired every day by how our customers use our software to find creative solutions to business problems, develop breakthrough ideas, and stay connected to what's most important to them. Our mission is to empower every person and every organization on the planet to achieve more. We will only achieve our mission if we live our culture. We start by becoming learners in all things – having a growth mindset. Then we apply that mindset to learning about our customers, being diverse and inclusive and working together as one.
job ad2 image