Job Requirements

Fort Meade, MD

Top Secret/SCI CI Polygraph

Mid Level Career (5+ yrs experience)

Salary not specified

Join Premium to unlock estimated salaries

Job Description

Title: AI Operations & Infrastructure Engineer

Location: Fort Meade, MD

Clearance: TS/SCI with a CI Polygraph

Job Details:

Manage and maintain AI computing platforms, including GPUs and other specialized hardware

Install and configure GPU drivers and software

Oversee the AI software stack and tools

Implement and manage containerization technologies like Docker and Kubernetes

Configure and optimize networking infrastructure for AI workloads, including InfiniBand and Ethernet

Manage storage solutions for AI data, considering performance and capacity requirements

Deploy and manage data processing units (DPUs) to accelerate data center workloads

Monitor and manage AI cluster health and resource utilization

Implement workload management and scheduling tools like Slurm and Kubernetes

Ensure efficient power and cooling for AI infrastructure to maintain optimal operating conditions

Configure high-performance networking solutions for AI and machine learning workloads

Optimize network performance to ensure maximum throughput and minimal latency for AI computations

Implement and fine-tune network protocols to enhance data transfer speeds and efficiency

Integrate NVIDIA networking products with existing AI infrastructure, including servers, GPUs, and storage systems

Deploy networking solutions in data centers to ensure seamless connectivity between AI components

Diagnose and resolve networking issues impacting AI workloads to maintain optimal system performance

Provide technical support and guidance to teams managing AI infrastructure

Collaborate with data scientists, researchers, and IT professionals to understand networking requirements and challenges

Lead deployment and validation of servers and systems for AI enabled platforms

Configure and manage network topologies, BMC, OOB, TPM, power, and cooling

Install, upgrade, and validate GPU-based servers, BlueField DPUs, cables, and transceivers

Perform firmware upgrades, hardware validation, and storage setup

Configure and administer physical and logical resources, including M IG partitioning and BlueField platforms

Install and configure operating systems, cluster software, drivers, containers (Docker), and NGC CLI

Manage and orchestrate clusters using NVIDIA Base Command Manager, Slurm, Pyxis, Enroot, and Run: Ai

Perform stress, benchmarking, and burn-in tests using HPL, NCCL, NVIDIA Nemo, and ClusterKit

Verify cabling, firmware/software versions, and network signal quality

Troubleshoot and resolve hardware, software, storage, and performance faults

Replace faulty components and optimize systems for AMD/Intel platforms

Monitor, document, and report on cluster health, resource usage, and job performance

Ensure secure, efficient, and scalable operation of NVIDIA AI infrastructure, including user access and workload management

Requirements:

Qualified candidates must hold an active NVIDIA Professional Certification in either AI Networking, AI Infrastructure, or AI Operations

Prior direct, hands-on professional experience administering NVIDIA GPU and data processing unit (DPU) technologies, AI software stacks, and data center environments for high-performance AI workloads

Comprehensive expertise in deploying and maintaining AI compute platforms, requiring proficiency in containerization and workload orchestration using Docker, Kubernetes, Slurm, NVIDIA Base Command Manager, and Run:Ai

Must be capable of configuring physical and logical resources, including Multi-Instance GPU (MIG) partitioning and BlueField platforms, while overseeing critical facility elements such as power, cooling, and storage solutions

The ability to demonstrate advanced skills in AI networking, specifically configuring and optimizing high-performance InfiniBand and Ethernet fabrics to ensure maximum throughput and minimal latency

Current active TS/SCI clearance with a CI Polygraph

Equal Opportunity Employer/Veterans/Disabled

group id: 90789821

AI Operations & Infrastructure Engineer

Invictus

Job Requirements

Job Description

Invictus

Similar Jobs

Location

Job Category

Clearance Level

Employer

Related Searches

AI Operations &amp; Infrastructure Engineer

Invictus

Job Requirements

Job Description

Invictus

Similar Jobs

Location

Job Category

Clearance Level

Employer

Related Searches

AI Operations & Infrastructure Engineer