Staff Cloud Operations Engineer
IrelandFull-TimeStaffOperations
What You'll Do:
- Architect & Scale Infrastructure: Design and implement multi-cluster, multi-region Kubernetes deployments using EKS, GKE, and AKS. Build infrastructure that scales across regions and cloud providers.
- Own Production Systems: Take end-to-end ownership of production infrastructure. Drive incident response, postmortems, and improvements to prevent recurrence.
- Infrastructure as Code at Scale: Build and maintain Terraform modules for complex infrastructure patterns. Manage thousands of configuration files across clusters, regions, and environments using GitOps principles.
- GitOps & Deployment Excellence: Design and optimize ArgoCD ApplicationSets and Helm chart architectures. Build deployment pipelines that enable safe, automated releases across hundreds of microservices.
- Performance & Reliability Engineering: Analyze system performance, identify bottlenecks, and implement optimizations. Improve SLOs through capacity planning, autoscaling, and architectural improvements.
- Observability & Monitoring: Build and enhance monitoring, alerting, and observability using Prometheus, Grafana, Loki, and custom tooling. Drive visibility into complex distributed systems.
- Security & Compliance: Implement security controls, compliance frameworks, and best practices across cloud infrastructure. Design secure multi-tenant architectures.
- Technical Leadership: Mentor engineers, establish best practices, and drive technical decisions. Collaborate with platform, SRE, and product teams to deliver reliable infrastructure.
What We're Looking For:
- 5+ years in cloud infrastructure engineering, with deep expertise in at least one major cloud provider (AWS preferred)
- Strong Kubernetes experience: cluster design, operators, controllers, and multi-cluster management
- Proficiency with Infrastructure as Code: Terraform, CloudFormation, or similar
- GitOps expertise: ArgoCD, Flux, or similar; experience with ApplicationSets and complex deployment patterns
- Deep Linux and networking knowledge
- Experience with distributed systems: Elasticsearch, PostgreSQL, Redis, Kafka, RabbitMQ
- Monitoring and observability: Prometheus, Grafana, ELK stack, or similar
- Strong problem-solving skills and experience debugging complex distributed systems
- Experience with cloud security, compliance (SOC2, ISO27001), and secure-by-design practices
- Excellent communication skills for working across time zones and with distributed teams
- Self-directed with a track record of owning problems end-to-end
- Ability to participate in the teams on-call rotation.
Nice to Have:
- Experience with multi-cloud architectures and cloud-agnostic patterns
- Contributions to open-source infrastructure projects
- Experience with service mesh technologies (Istio, Linkerd)
- Knowledge of chaos engineering and reliability testing
- Experience with cost optimization and FinOps practices
Why This Role:
- Work on infrastructure at scale: hundreds of clusters, thousands of services, global reach
- Deep technical ownership: design, build, and operate critical systems
- Modern stack: Kubernetes, GitOps, Infrastructure as Code, cloud-native tools
- Impact: infrastructure decisions affect millions of users
- Growth: work with experienced engineers and tackle complex challenges
