Posted today

Top Secret

Unspecified

Rockville, MD (On-Site/Office)

AI/ML Engineer - Local LLM & RAG Systems

PTFS is seeking an experienced AI/ML Engineer with strong expertise in deploying and managing locally hosted Large Language Models (LLMs) and building

Retrieval-Augmented Generation (RAG) pipelines. The ideal candidate has hands-on experience with frameworks such as Ollama, LangChain, LlamaIndex, or VLLM, and is highly skilled in Python-based orchestration, vector search, and scalable data storage systems such as Vector Databases or Apache Solr. This role will be responsible for designing, optimizing, and maintaining our on-premise or air-gapped GenAI infrastructure, integrating new models, and keeping our architecture modular and future-proof.

LLM Deployment & Orchestration

Deploy, run, and optimize locally hosted LLMs using frameworks such as Ollama, VLLM, GPT4All, or HuggingFace Transformers.

Build and maintain model-serving pipelines with Python, including GPU optimization, quantization, batching, and model switching.
Implement flexible architecture allowing rapid integration of new open-source or proprietary models.

RAG Pipeline Development

Architect end-to-end Retrieval-Augmented Generation (RAG) systems.

Design and implement vector embedding, indexing, and retrieval layers, including chunking, metadata management, and routing logic.
Integrate RAG flows using LangChain or LlamaIndex, ensuring low latency and high retrieval accuracy.

Data Storage and Retrieval

Develop and maintain Vector Databases such as:
- Pinecone
- Weaviate
- Chroma
- Milvus
- FAISS
Or , architect a schema and search strategy for a Solr-based alternative using traditional indexing/search if vectors are not used.
Manage ingestion pipelines, embedding generation, and update workflows for newly added data sources.

Application & API Development

Build backend services and APIs that interact with LLMs, embedding pipelines, and retrieval layers.

Integrate agents, tools, and orchestration flows using:

LangChain
OpenAI function-calling equivalents in local models
Custom Python toolchains

Deploy services using Docker, Kubernetes, or local orchestrators when needed.

System Performance, Optimization & Monitoring

Optimize model performance, including:
- quantization (GGUF, GPTQ, AWQ)
- tensor parallelization
- caching strategies

Monitor system resources for memory, GPU/CPU utilization, and throughput.
Implement automated pipelines to update models, refresh embedding stores, and version datasets.

** Collaboration & Architecture**

Work with cross-functional teams to align the LLM capabilities with business needs.
Provide guidance on GenAI trends, limitations, and best practices.
Contribute to documentation and provide internal training when needed.

Required Skills & Experience
Technical Skills

3-7+ years of experience in Machine Learning, MLOps, Backend Engineering, or AI Infrastructure.
Expert-level proficiency in Python and relevant libraries (FastAPI, Pydantic, PyTorch, HuggingFace Ecosystem).
Hands-on experience with LLM deployment via:
- Ollama
- VLLM
- GPT4All
- HuggingFace Transformers
- LM Studio
Strong experience with RAG frameworks:
- LangChain
- LlamaIndex
Proficiency with vector databases (Pinecone, Chroma, Weaviate, FAISS, Milvus).
Experience with Solr, Elasticsearch, or OpenSearch (schema design, analyzers, indexing).
Experience developing embeddings pipelines, chunking strategies, and metadata retrieval.
Familiarity with containerization and orchestration (Docker, Kubernetes optional).
Strong experience with model inference optimization: quantization, batching, GPU acceleration.

ML/AI Knowledge

Understanding of foundational LLM mechanics: transformers, tokenization, context windows, prompt engineering.
Experience with model fine-tuning, LoRA adapters, or supervised fine-tuning (a plus).
Knowledge of GenAI architectural patterns, agents, routing, tool use, and document indexing strategies.Preferred Qualifications Experience working in air-gapped or on-premise environments.
Experience with CI/CD for ML systems.
Familiarity with:
- Nvidia GPU stack (CUDA, cuBLAS, TensorRT)
- DevOps tools (Terraform, Ansible, Helm charts)
Exposure to hybrid search systems combining vector + keyword retrieval (BM25 + embeddings).
Experience integrating LLMs into enterprise systems.Education
Bachelor's degree in Computer Science, Data Science, Engineering, Mathematics, or related field.
Master's or higher preferred, but equivalent experience accepted.

Soft Skills

Strong problem-solving ability and comfort working with ambiguous or evolving requirements.
Excellent communication and ability to translate technical concepts for non-technical teams.
Self-driven with a passion for exploring new GenAI technologies and keeping current with evolving LLM tools.

Summary

This role is ideal for someone who enjoys building practical, production-ready AI systems, particularly local LLMs, and wants to work at the cutting edge of the GenAI landscape-integrating models, designing robust retrieval systems, and ensuring future scalability.

group id: RTL253009

Name HiddenRecruiter

AI/ML Engineer

PTFS

Match Score

Similar Jobs

Location

Clearance Level

Employer