Senior Cluster Site Reliability Engineer
Berkeley, CAFull-TimeSeniorDevOps
Responsibilities
- Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise
- Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability
- Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams
- Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do
- Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies
- Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability
Requirements
- 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead
- Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod)
- Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
- Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible)
- Experience with cloud infrastructure (AWS or GCP)
- Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)
- Experience with distributed storage technologies (Lustre, Ceph, S3)
- Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation
- Bachelor degree in computer science
Preferred Qualifications
- Hands-on experience with HPC frameworks (Slurm, Grid Engine) and Kubernetes-based job orchestrators (Airflow, Kueue, Kubeflow Pipelines), along with other distributed computing frameworks (Ray, Modin, Dask, Spark)
- Familiarity with ML frameworks (PyTorch/Tensorflow, JAX, Horovod, DeepSpeed)
- Familiarity with hybrid/on-prem environments
- Experience with containerization (Docker, Podman, Singularity), particularly for HPC/batch compute environments
- Experience with HPC networking (InfiniBand, RDMA)
- Solid security/IAM foundations (Identity management systems, AWS/GCP IAM, Zero Trust)
