Job Requirements
Fort Meade, MD
Top Secret/SCI CI Polygraph
Mid Level Career (5+ yrs experience)
Salary not specified
Join Premium to unlock estimated salaries
Job Description
Title: AI Operations & Infrastructure Engineer
Location: Fort Meade, MD
Clearance: TS/SCI with a CI Polygraph
Job Details:
Requirements:
Equal Opportunity Employer/Veterans/Disabled
Location: Fort Meade, MD
Clearance: TS/SCI with a CI Polygraph
Job Details:
- Manage and maintain AI computing platforms, including GPUs and other specialized hardware
- Install and configure GPU drivers and software
- Oversee the AI software stack and tools
- Implement and manage containerization technologies like Docker and Kubernetes
- Configure and optimize networking infrastructure for AI workloads, including InfiniBand and Ethernet
- Manage storage solutions for AI data, considering performance and capacity requirements
- Deploy and manage data processing units (DPUs) to accelerate data center workloads
- Monitor and manage AI cluster health and resource utilization
- Implement workload management and scheduling tools like Slurm and Kubernetes
- Ensure efficient power and cooling for AI infrastructure to maintain optimal operating conditions
- Configure high-performance networking solutions for AI and machine learning workloads
- Optimize network performance to ensure maximum throughput and minimal latency for AI computations
- Implement and fine-tune network protocols to enhance data transfer speeds and efficiency
- Integrate NVIDIA networking products with existing AI infrastructure, including servers, GPUs, and storage systems
- Deploy networking solutions in data centers to ensure seamless connectivity between AI components
- Diagnose and resolve networking issues impacting AI workloads to maintain optimal system performance
- Provide technical support and guidance to teams managing AI infrastructure
- Collaborate with data scientists, researchers, and IT professionals to understand networking requirements and challenges
- Lead deployment and validation of servers and systems for AI enabled platforms
- Configure and manage network topologies, BMC, OOB, TPM, power, and cooling
- Install, upgrade, and validate GPU-based servers, BlueField DPUs, cables, and transceivers
- Perform firmware upgrades, hardware validation, and storage setup
- Configure and administer physical and logical resources, including M IG partitioning and BlueField platforms
- Install and configure operating systems, cluster software, drivers, containers (Docker), and NGC CLI
- Manage and orchestrate clusters using NVIDIA Base Command Manager, Slurm, Pyxis, Enroot, and Run: Ai
- Perform stress, benchmarking, and burn-in tests using HPL, NCCL, NVIDIA Nemo, and ClusterKit
- Verify cabling, firmware/software versions, and network signal quality
- Troubleshoot and resolve hardware, software, storage, and performance faults
- Replace faulty components and optimize systems for AMD/Intel platforms
- Monitor, document, and report on cluster health, resource usage, and job performance
- Ensure secure, efficient, and scalable operation of NVIDIA AI infrastructure, including user access and workload management
Requirements:
- Qualified candidates must hold an active NVIDIA Professional Certification in either AI Networking, AI Infrastructure, or AI Operations
- Prior direct, hands-on professional experience administering NVIDIA GPU and data processing unit (DPU) technologies, AI software stacks, and data center environments for high-performance AI workloads
- Comprehensive expertise in deploying and maintaining AI compute platforms, requiring proficiency in containerization and workload orchestration using Docker, Kubernetes, Slurm, NVIDIA Base Command Manager, and Run:Ai
- Must be capable of configuring physical and logical resources, including Multi-Instance GPU (MIG) partitioning and BlueField platforms, while overseeing critical facility elements such as power, cooling, and storage solutions
- The ability to demonstrate advanced skills in AI networking, specifically configuring and optimizing high-performance InfiniBand and Ethernet fabrics to ensure maximum throughput and minimal latency
- Current active TS/SCI clearance with a CI Polygraph
Equal Opportunity Employer/Veterans/Disabled
group id: 90789821