All vacancies

Senior AI Compute Infrastructure Engineer

Kraken · remote · senior · full-time

aicryptoweb3 GPU computeML infrastructuredistributed systemshigh-performance computingPythonKubernetes

Apply

6.5

AI Score

The vacancy is well-structured but lacks compensation details, affecting overall quality.

no salary info

Job description

**Building the Future of Crypto** Our Krakenites are a world-class team with crypto conviction, united by our desire to discover and unlock the potential of crypto and blockchain technology. **What makes us different?** Kraken is a mission-focused company rooted in crypto values. As a Krakenite, you’ll join us on our mission to accelerate the global adoption of crypto, so that everyone can achieve financial freedom and inclusion. For over a decade, Kraken’s focus on our mission and crypto ethos has attracted many of the most talented crypto experts in the world. Before you apply, please read the [Kraken Culture](https://www.kraken.com/culture) page to learn more about our internal culture, values, and mission. We also expect candidates to familiarize themselves with the Kraken app. Learn how to create a Kraken account [here](https://support.kraken.com/hc/en-us/articles/226090548-How-to-create-an-account-on-Kraken). As a fully remote company, we have Krakenites in 70+ countries who speak over 50 languages. Krakenites are industry pioneers who develop premium crypto products for experienced traders, institutions, and newcomers to the space. Kraken is committed to [industry-leading security](https://blog.kraken.com/crypto-education/security-at-kraken), [crypto education](https://blog.kraken.com/category/crypto-education), and [world-class client support](https://blog.kraken.com/crypto-education/support-at-kraken) through our products like [Kraken Pro](https://pro.kraken.com/), [Desktop](https://www.kraken.com/desktop), [Wallet](https://www.kraken.com/wallet), and [Kraken Futures](https://www.kraken.com/features/futures). **Become a Krakenite and build the future of crypto!**

Responsibilities

## The opportunity - Own and operate GPU and accelerator clusters used for training, inference, evaluation, and experimentation, including drivers, runtimes, kernels, device plugins, node configuration, scheduling primitives, and workload isolation. - Design infrastructure that enables Kraken teams to run models locally on GPUs where it is strategically and economically preferable, reducing unnecessary dependency on external providers and containing compute costs. - Build and improve scheduling, orchestration, placement, quota management, and utilization systems across heterogeneous accelerator environments. - Optimize inference pipelines for latency, throughput, reliability, memory efficiency, and cost using frameworks such as vLLM, Triton Inference Server, TensorRT, or equivalent serving stacks. - Partner with ML engineers and researchers to remove bottlenecks in training, evaluation, batch inference, online inference, deployment, and production debugging workflows. - Build observability for GPU utilization, memory pressure, queue depth, saturation, token throughput, request latency, failed workloads, capacity pressure, and spend. - Drive reliability, incident response, alerting, runbooks, and post-incident improvements for always-on AI compute infrastructure. - Evaluate and integrate new hardware, cloud instance families, specialized accelerators, runtimes, schedulers, and serving frameworks as the AI infrastructure landscape evolves. - Build tooling that makes GPU usage visible, accountable, and easier for internal teams to consume without needing to become infrastructure experts. - Contribute to long-term architecture decisions that balance performance, cost efficiency, scalability, operational simplicity, and production safety.

Requirements

## Skills you should HODL - 5+ years of infrastructure engineering experience, with significant time spent on GPU compute, ML infrastructure, distributed systems, high-performance computing, or large-scale production platforms. - Hands-on experience operating GPU clusters or accelerator-backed infrastructure in production or production-like environments, including scheduling, orchestration, utilization monitoring, and cost optimization. - Strong systems engineering fundamentals across Linux, networking, storage, containers, Kubernetes, distributed runtimes, and production debugging. - Experience with ML serving frameworks such as vLLM, Triton Inference Server, TensorRT, TorchServe, KServe, Ray Serve, or equivalent systems. - Proficiency in Python for infrastructure automation, tooling, debugging, integration, and operational workflows. - Practical understanding of performance tradeoffs across batching, concurrency, memory usage, GPU utilization, model size, latency, throughput, availability, and cost. - Track record of optimizing compute costs while maintaining clear performance, reliability, and availability expectations. - Experience building observable systems with useful metrics, logs, traces, dashboards, alerts, and incident workflows. - Comfortable working in high-stakes, always-on environments where uptime, throughput, correctness, and operational discipline are critical. - Clear communicator who can translate infrastructure tradeoffs for researchers, product teams, platform engineers, security stakeholders, and engineering leadership.

About Kraken

Kraken (legally Payward, Inc.) is a US-based cryptocurrency exchange that facilitates trading of cryptocurrencies, stocks, futures, and ETFs in most US states. It serves over 10 million clients worldwide with $207 billion in quarterly trading volume and has expanded to tokenized equities for non-US customers.

Crypto · 1000+ · San Francisco, United States · Founded 2011 · https://www.kraken.com/

Apply to this role