Kraken

Senior AI Compute Infrastructure Engineer

6.0/10

Kraken

Not specified
Remote
senior
5 days ago
aicryptotechGPU computeML infrastructuredistributed systemshigh-performance computingLinuxPythonKubernetescontainersobservability

AI Summary

The vacancy is strong in task clarity and requirements but lacks compensation details.

Check Match — Just drop your CV

See your fit for Senior AI Compute Infrastructure Engineer in seconds.

Description

What you'll do

  • Own and operate GPU and accelerator clusters used for training, inference, evaluation, and experimentation.
  • Design infrastructure that enables Kraken teams to run models locally on GPUs.
  • Build and improve scheduling, orchestration, placement, quota management, and utilization systems.
  • Optimize inference pipelines for latency, throughput, reliability, memory efficiency, and cost.
  • Partner with ML engineers and researchers to remove bottlenecks in workflows.
  • Build observability for GPU utilization, memory pressure, and other metrics.
  • Drive reliability, incident response, alerting, and post-incident improvements.
  • Evaluate and integrate new hardware and cloud instance families.
  • Build tooling that makes GPU usage visible and easier for internal teams.

Requirements

  • 5+ years of infrastructure engineering experience, with significant time spent on GPU compute and ML infrastructure.
  • Hands-on experience operating GPU clusters or accelerator-backed infrastructure.
  • Strong systems engineering fundamentals across Linux, networking, storage, and containers.
  • Experience with ML serving frameworks such as vLLM, Triton Inference Server, or equivalent.
  • Proficiency in Python for infrastructure automation and operational workflows.
  • Practical understanding of performance tradeoffs across various metrics.
  • Track record of optimizing compute costs while maintaining performance and reliability.
  • Experience building observable systems with useful metrics and incident workflows.
  • Comfortable working in high-stakes, always-on environments.
Loading similar jobs...