The vacancy is well-structured but lacks compensation details, affecting overall quality.
no salary info
Job description
**Building the Future of Crypto** Our Krakenites are a world-class team with crypto conviction, united by our desire to discover and unlock the potential of crypto and blockchain technology. **What makes us different?** Kraken is a mission-focused company rooted in crypto values. As a Krakenite, you’ll join us on our mission to accelerate the global adoption of crypto, so that everyone can achieve financial freedom and inclusion. For over a decade, Kraken’s focus on our mission and crypto ethos has attracted many of the most talented crypto experts in the world. Before you apply, please read the [Kraken Culture](https://www.kraken.com/culture) page to learn more about our internal culture, values, and mission. We also expect candidates to familiarize themselves with the Kraken app. Learn how to create a Kraken account [here](https://support.kraken.com/hc/en-us/articles/226090548-How-to-create-an-account-on-Kraken). As a fully remote company, we have Krakenites in 70+ countries who speak over 50 languages. Krakenites are industry pioneers who develop premium crypto products for experienced traders, institutions, and newcomers to the space. Kraken is committed to [industry-leading security](https://blog.kraken.com/crypto-education/security-at-kraken), [crypto education](https://blog.kraken.com/category/crypto-education), and [world-class client support](https://blog.kraken.com/crypto-education/support-at-kraken) through our products like [Kraken Pro](https://pro.kraken.com/), [Desktop](https://www.kraken.com/desktop), [Wallet](https://www.kraken.com/wallet), and [Kraken Futures](https://www.kraken.com/features/futures). **Become a Krakenite and build the future of crypto!**
Responsibilities
## The opportunity
- Own and operate GPU and accelerator clusters used for training, inference, evaluation, and experimentation, including drivers, runtimes, kernels, device plugins, node configuration, scheduling primitives, and workload isolation.
- Design infrastructure that enables Kraken teams to run models locally on GPUs where it is strategically and economically preferable, reducing unnecessary dependency on external providers and containing compute costs.
- Build and improve scheduling, orchestration, placement, quota management, and utilization systems across heterogeneous accelerator environments.
- Optimize inference pipelines for latency, throughput, reliability, memory efficiency, and cost using frameworks such as vLLM, Triton Inference Server, TensorRT, or equivalent serving stacks.
- Partner with ML engineers and researchers to remove bottlenecks in training, evaluation, batch inference, online inference, deployment, and production debugging workflows.
- Build observability for GPU utilization, memory pressure, queue depth, saturation, token throughput, request latency, failed workloads, capacity pressure, and spend.
- Drive reliability, incident response, alerting, runbooks, and post-incident improvements for always-on AI compute infrastructure.
- Evaluate and integrate new hardware, cloud instance families, specialized accelerators, runtimes, schedulers, and serving frameworks as the AI infrastructure landscape evolves.
- Build tooling that makes GPU usage visible, accountable, and easier for internal teams to consume without needing to become infrastructure experts.
- Contribute to long-term architecture decisions that balance performance, cost efficiency, scalability, operational simplicity, and production safety.
Requirements
## Skills you should HODL
- 5+ years of infrastructure engineering experience, with significant time spent on GPU compute, ML infrastructure, distributed systems, high-performance computing, or large-scale production platforms.
- Hands-on experience operating GPU clusters or accelerator-backed infrastructure in production or production-like environments, including scheduling, orchestration, utilization monitoring, and cost optimization.
- Strong systems engineering fundamentals across Linux, networking, storage, containers, Kubernetes, distributed runtimes, and production debugging.
- Experience with ML serving frameworks such as vLLM, Triton Inference Server, TensorRT, TorchServe, KServe, Ray Serve, or equivalent systems.
- Proficiency in Python for infrastructure automation, tooling, debugging, integration, and operational workflows.
- Practical understanding of performance tradeoffs across batching, concurrency, memory usage, GPU utilization, model size, latency, throughput, availability, and cost.
- Track record of optimizing compute costs while maintaining clear performance, reliability, and availability expectations.
- Experience building observable systems with useful metrics, logs, traces, dashboards, alerts, and incident workflows.
- Comfortable working in high-stakes, always-on environments where uptime, throughput, correctness, and operational discipline are critical.
- Clear communicator who can translate infrastructure tradeoffs for researchers, product teams, platform engineers, security stakeholders, and engineering leadership.
About Kraken
Kraken (legally Payward, Inc.) is a US-based cryptocurrency exchange that facilitates trading of cryptocurrencies, stocks, futures, and ETFs in most US states. It serves over 10 million clients worldwide with $207 billion in quarterly trading volume and has expanded to tokenized equities for non-US customers.