All vacancies

DataOps Engineer (AI Platform Engineer)

Exness · office · middle · full-time

aitech GPU infrastructureKubernetesPythonGoLinuxCI/CDobservability tools

Apply

6.5

AI Score

The vacancy is well-defined but lacks compensation details, affecting overall attractiveness to applicants.

no salary info

Job description

At Exness, we are not just a leading trading broker—we’ve reimagined what it takes to be a leader. With 40M+ trades a day and 2,000+ people across 13 countries, we combine scale, care, and real tech to make trading better for 1M+ clients worldwide. Recognised globally as a Best Place to Work, we’re a people-first company where long-term wins always matter more. As part of our team, you will shape the future of fintech with real technology, care, and purpose.

Responsibilities

### What you'll actually do - Close collaboration with infrastructure teams on selection and configuring GPU servers, high-performance networking, and RDMA-enabled clusters. - Perform and manage GPU MIG configurations based on workload requirements and model characteristics. - Ensure reliable and scalable GPU operations in Kubernetes, including runtime integration, device plugins, and GPU scheduling capabilities. - Design, deploy, and maintain model serving runtimes, including vLLM, ONNX, SGLang, Nvidia Triton Runtimes, and KServe, ensuring high performance, scalability, and efficient GPU utilization. - Build and maintain CI/CD pipelines and tooling for model packaging, versioning, and deployment, enabling reliable and model delivery for internal teams. - Build and maintain platform tooling for model lifecycle management, including experiment tracking, model versioning, and registry systems (e.g. MLflow). - Enable infrastructure and workflows for model fine-tuning and adaptation (e.g. LoRA), focusing on scalability, reproducibility, and automation within the platform. - Develop and support internal tooling for managing model inputs and configurations (e.g. prompt templates), enabling consistent and reusable model usage patterns. - Conduct performance testing and evaluation of multi-node GPU clusters to identify and resolve bottlenecks. - Build and maintain observability for GPU clusters and model workloads, including metrics such as GPU utilization, memory usage, throughput, and latency. - Integrate tracing for model inference workflows to provide end-to-end visibility into requests, and model behavior. - Ensure compliance with security requirements for platform development. - Evaluate and benchmark model inference performance across different runtimes, hardware setups, and configurations to guide platform optimization.

Requirements

### Who we’re looking for - Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field. - 5+ years of experience in infrastructure, platform engineering, or distributed systems. - Hands-on experience working with GPU infrastructure, including NVIDIA or AMD stack and multi-GPU environments. - Strong experience with Kubernetes, including deploying and operating production workloads. - Experience with Linux-based environments. - Strong programming skills in Python and/or Go. - Understanding of distributed systems and multi-node workloads. - Experience with model serving and inference systems (e.g. vLLM, ONNX, SGLang, Nvidia Triton Runtimes, KServe). - Experience with CI/CD pipelines and automation for deploying services or models. - Experience with monitoring and observability tools (metrics, tracing, logging). - Nice to have familiarity with networking concepts relevant to distributed systems (e.g. RDMA, high-performance networking). - Good communication and problem-solving skills. - Ability to use advanced English for different work and business purposes. - Critical thinking and attention to detail. - Decision-making skills and the ability to adapt to new changes.

Conditions

### What we offer - Full relocation support for you and your family to make your move smooth and worry-free.

About Exness

Exness is a global multi-asset retail broker founded in 2008 that provides online trading services to over 1 million clients worldwide. The company processes 40+ million trades daily and operates as an ethical trading platform across 13 countries.

FinTech · 1000+ · Limassol, Cyprus · Founded 2008 · https://grnh.se/n8kuz0m2teu

Apply to this role