user avatar

Senior AI Infrastructure Engineer

Technical Intelligence Solutions

Today
Top Secret/SCI
$150,000 and above
IT - Data Science
Reston, VA (On-Site/Office)

Overview
This role requires 10+ years of infrastructure experience with a strong focus on AI/ML environments. The ideal candidate will have deep expertise in Linux system administration, GPU-based computing (especially with NVIDIA DGX systems), and distributed data platforms such as Cloudera and Hadoop. Key responsibilities include leading the setup and maintenance of AI infrastructure, supporting scalable ML pipelines, deploying and monitoring LLMs, and troubleshooting across the AI/ML stack. Familiarity with MLOps tools (e.g., MLflow, Kubeflow, Airflow) and an understanding of networking and storage optimizations for ML workloads are essential. The position also involves close collaboration with data science teams and requires clear documentation and knowledge transfer. Onsite Required.

Security Clearance: TS/SCI Required

Minimum Requirements:
• 10+ years of experience in infrastructure, with deep expertise in AI/ML environments.
• Strong background in Linux system administration, hardware architecture, and GPU-based computing.
• Proficiency in Cloudera, Hadoop, and distributed data processing systems.
• Experience setting up and maintaining NVIDIA DGX systems or similar GPU platforms.
• Knowledge of networking and storage optimizations for ML workloads.
• Familiarity with LLM deployment workflows.
• Understanding of modern MLOps tools and practices (e.g., MLflow, Kubeflow, Airflow).

Key Responsibilities:
• Lead the setup, configuration, and maintenance of AI infrastructure, including NVIDIA DGX systems and GPU clusters.
• Manage and optimize big data environments to support scalable ML pipelines.
• Collaborate with data scientists to deploy, monitor, and scale LLMs and generative AI workloads.
• Troubleshoot complex issues across the AI/ML hardware and software stack.
• Provide documentation and training to support knowledge transfer and sustainability.
• Serve as a liaison between ML development teams and infrastructure operations to ensure reliable production performance.

Skills and Proficiencies:
• Linux system administration.
• GPU-based computing and NVIDIA DGX systems.
• Distributed computing platforms (Cloudera, Hadoop).
• ML workload infrastructure design and optimization.
• Troubleshooting complex AI/ML systems.
• Networking and storage for high-performance computing.
• MLOps tools and best practices.

Additional Information:
• Candidates should have experience supporting production-grade AI systems and collaborating closely with data science teams.

About us:
Technical Intelligence Solutions (TIS) is dedicated to delivering top-notch solutions to our customers by building a team of highly qualified professionals who thrive in a collaborative, idea-driven, innovative environment.


Founded and operated by experienced engineers, TIS understands customer goals, strategies, and the expertise required to craft innovative solutions. We specialize in supporting critical DoD missions with reliable, efficient systems, networks, and applications that excel in real-time operations.

Benefits:
We offer a comprehensive benefits package, including bi-weekly pay, 20 days of PTO, a 5% safe harbor 401k, professional development reimbursement, and a variety of healthcare options for eligible employees.

To Apply: Interested candidates should submit their resume for consideration.
group id: 91137975
N
Name HiddenRecruiter

Match Score

Powered by IntelliSearch™
image match score
Create an account or Login to see how closely you match to this job!