user avatar

AI Operations & Infrastructure Engineer

Invictus

Posted today

Job Requirements

Fort Meade, MD
Top Secret/SCI CI Polygraph
Mid Level Career (5+ yrs experience)
Salary not specified
Join Premium to unlock estimated salaries

Job Description

Title: AI Operations & Infrastructure Engineer


Location: Fort Meade, MD


Clearance: TS/SCI with a CI Polygraph

Job Details:


  • Manage and maintain AI computing platforms, including GPUs and other specialized hardware

  • Install and configure GPU drivers and software

  • Oversee the AI software stack and tools

  • Implement and manage containerization technologies like Docker and Kubernetes

  • Configure and optimize networking infrastructure for AI workloads, including InfiniBand and Ethernet

  • Manage storage solutions for AI data, considering performance and capacity requirements

  • Deploy and manage data processing units (DPUs) to accelerate data center workloads

  • Monitor and manage AI cluster health and resource utilization

  • Implement workload management and scheduling tools like Slurm and Kubernetes

  • Ensure efficient power and cooling for AI infrastructure to maintain optimal operating conditions

  • Configure high-performance networking solutions for AI and machine learning workloads

  • Optimize network performance to ensure maximum throughput and minimal latency for AI computations

  • Implement and fine-tune network protocols to enhance data transfer speeds and efficiency

  • Integrate NVIDIA networking products with existing AI infrastructure, including servers, GPUs, and storage systems

  • Deploy networking solutions in data centers to ensure seamless connectivity between AI components

  • Diagnose and resolve networking issues impacting AI workloads to maintain optimal system performance

  • Provide technical support and guidance to teams managing AI infrastructure

  • Collaborate with data scientists, researchers, and IT professionals to understand networking requirements and challenges

  • Lead deployment and validation of servers and systems for AI enabled platforms

  • Configure and manage network topologies, BMC, OOB, TPM, power, and cooling

  • Install, upgrade, and validate GPU-based servers, BlueField DPUs, cables, and transceivers

  • Perform firmware upgrades, hardware validation, and storage setup

  • Configure and administer physical and logical resources, including M IG partitioning and BlueField platforms

  • Install and configure operating systems, cluster software, drivers, containers (Docker), and NGC CLI

  • Manage and orchestrate clusters using NVIDIA Base Command Manager, Slurm, Pyxis, Enroot, and Run: Ai

  • Perform stress, benchmarking, and burn-in tests using HPL, NCCL, NVIDIA Nemo, and ClusterKit

  • Verify cabling, firmware/software versions, and network signal quality

  • Troubleshoot and resolve hardware, software, storage, and performance faults

  • Replace faulty components and optimize systems for AMD/Intel platforms

  • Monitor, document, and report on cluster health, resource usage, and job performance

  • Ensure secure, efficient, and scalable operation of NVIDIA AI infrastructure, including user access and workload management




Requirements:


  • Qualified candidates must hold an active NVIDIA Professional Certification in either AI Networking, AI Infrastructure, or AI Operations

  • Prior direct, hands-on professional experience administering NVIDIA GPU and data processing unit (DPU) technologies, AI software stacks, and data center environments for high-performance AI workloads

  • Comprehensive expertise in deploying and maintaining AI compute platforms, requiring proficiency in containerization and workload orchestration using Docker, Kubernetes, Slurm, NVIDIA Base Command Manager, and Run:Ai

  • Must be capable of configuring physical and logical resources, including Multi-Instance GPU (MIG) partitioning and BlueField platforms, while overseeing critical facility elements such as power, cooling, and storage solutions

  • The ability to demonstrate advanced skills in AI networking, specifically configuring and optimizing high-performance InfiniBand and Ethernet fabrics to ensure maximum throughput and minimal latency

  • Current active TS/SCI clearance with a CI Polygraph



Equal Opportunity Employer/Veterans/Disabled
group id: 90789821