Senior Cluster Site Reliability Engineer

Berkeley, CAFull-TimeSeniorDevOps

Skills

Python Ruby AWS GCP Kubernetes Docker Terraform Ansible S3 Prometheus Grafana ELK Machine learning TensorFlow PyTorch Spark Security OpenTelemetry Airflow MLflow Kubeflow DeepSpeed

You will be redirected to the company career page

Responsibilities

Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise
Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability
Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams
Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do
Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies
Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability

Requirements

5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead
Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod)
Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible)
Experience with cloud infrastructure (AWS or GCP)
Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)
Experience with distributed storage technologies (Lustre, Ceph, S3)
Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation
Bachelor degree in computer science

Preferred Qualifications

Hands-on experience with HPC frameworks (Slurm, Grid Engine) and Kubernetes-based job orchestrators (Airflow, Kueue, Kubeflow Pipelines), along with other distributed computing frameworks (Ray, Modin, Dask, Spark)
Familiarity with ML frameworks (PyTorch/Tensorflow, JAX, Horovod, DeepSpeed)
Familiarity with hybrid/on-prem environments
Experience with containerization (Docker, Podman, Singularity), particularly for HPC/batch compute environments
Experience with HPC networking (InfiniBand, RDMA)
Solid security/IAM foundations (Identity management systems, AWS/GCP IAM, Zero Trust)

Job Summary

CompanyVoleon

LocationBerkeley, CA

TypeFull-Time

LevelSenior

DomainDevOps

Similar roles you might like

View all DevOps roles

Site Reliability Operations Analyst - Commercial

Seoul, South KoreaFull-Time

Site Reliability Operations Analyst - US Government

Washington, D.C.Full-Time

Site Reliability Operations Analyst - Commercial

New York, NYFull-Time

More roles at Voleon

View company profile

Technical Project Manager

Berkeley, CAFull-Time

Product / Project

Jira Confluence Machine learning Agile Scrum

Business Strategy Manager

Berkeley, CAFull-Time

Senior Member of Research Staff

Berkeley, CAFull-Time

Python Machine learning Mentoring Documentation Ownership