Senior AI Compute Infrastructure Engineer

Kraken · remote · senior · full-time

aicryptotech GPU computeML infrastructuredistributed systemshigh-performance computingLinuxPythonKubernetescontainersobservabilityAWS TrainiumTriton Inference ServerTensorRT

Apply

6.5

AI Score

The vacancy is strong in task clarity and requirements but lacks compensation details.

no salary info

Job description

Our Krakenites are a world-class team with crypto conviction, united by our desire to discover and unlock the potential of crypto and blockchain technology. Kraken is a mission-focused company rooted in crypto values. As a Krakenite, you’ll join us on our mission to accelerate the global adoption of crypto, so that everyone can achieve financial freedom and inclusion. For over a decade, Kraken’s focus on our mission and crypto ethos has attracted many of the most talented crypto experts in the world. As a fully remote company, we have Krakenites in 70+ countries who speak over 50 languages. Krakenites are industry pioneers who develop premium crypto products for experienced traders, institutions, and newcomers to the space. Kraken is committed to industry-leading security, crypto education, and world-class client support through our products like Kraken Pro, Desktop, Wallet, and Kraken Futures. Become a Krakenite and build the future of crypto! Kraken is building a dedicated AI Compute and Infrastructure team to power the next generation of model training, inference, evaluation, and experimentation across the exchange. This team sits within engineering leadership and owns the infrastructure layer that lets Kraken run AI workloads with control, speed, reliability, and cost discipline.

Responsibilities

- Own and operate GPU and accelerator clusters used for training, inference, evaluation, and experimentation. - Design infrastructure that enables Kraken teams to run models locally on GPUs. - Build and improve scheduling, orchestration, placement, quota management, and utilization systems. - Optimize inference pipelines for latency, throughput, reliability, memory efficiency, and cost. - Partner with ML engineers and researchers to remove bottlenecks in workflows. - Build observability for GPU utilization, memory pressure, and other metrics. - Drive reliability, incident response, alerting, and post-incident improvements. - Evaluate and integrate new hardware and cloud instance families. - Build tooling that makes GPU usage visible and easier for internal teams.

Requirements

- 5+ years of infrastructure engineering experience, with significant time spent on GPU compute and ML infrastructure. - Hands-on experience operating GPU clusters or accelerator-backed infrastructure. - Strong systems engineering fundamentals across Linux, networking, storage, and containers. - Experience with ML serving frameworks such as vLLM, Triton Inference Server, or equivalent. - Proficiency in Python for infrastructure automation and operational workflows. - Practical understanding of performance tradeoffs across various metrics. - Track record of optimizing compute costs while maintaining performance and reliability. - Experience building observable systems with useful metrics and incident workflows. - Comfortable working in high-stakes, always-on environments.

About Kraken

Kraken (legally Payward, Inc.) is a US-based cryptocurrency exchange that facilitates trading of cryptocurrencies, stocks, futures, and ETFs in most US states. It serves over 10 million clients worldwide with $207 billion in quarterly trading volume and has expanded to tokenized equities for non-US customers.

Crypto · 1000+ · San Francisco, United States · Founded 2011 · https://kraken.com

Apply to this role