Principal Cloud Operations Engineer (10166)
San Jose, California, United StatesFull-TimeStaffOperations
Responsibilities:
- Provide technical leadership in cloud architecture, operational excellence, reliability, and cost optimization across large-scale production environments.
- Stay current with industry trends and best practices, and leverage AI technologies and cloud service provider platforms (AWS, Google Cloud, and Azure) to improve operational efficiency, scalability, security, and resiliency.
- Design and ensure secure, reliable, and high-performance communication across multiple regions and cloud service providers.
- Configure, tune, and operate middleware services, including SQL and NoSQL databases, messaging and streaming platforms, and related infrastructure components.
- Evaluate, recommend, and lead the adoption of CloudOps and DevOps tools, platforms, and automation solutions.
- Troubleshoot complex production infrastructure and application issues, providing deep technical expertise and hands-on support when required.
- Drive root cause analysis (RCA), implement corrective actions, and establish preventive measures to avoid recurrence.
- Collaborate closely with engineering cloud architects in system design discussions, architecture reviews, and whiteboard sessions.
- Partner with Development, QA, SRE, and external service providers or carriers to resolve issues and improve system reliability.
- Design, implement, and evolve deployment automation platforms for Kubernetes-based microservices.
- Improve service availability, performance, and scalability through automation, tooling, capacity planning, and process improvements.
- Analyze system and service performance, identify bottlenecks, and deliver actionable recommendations to improve efficiency and resilience.
Qualifications:
- BS level technical degree required; Computer Science or Engineering background preferred.
- 8+ years of experience in a CloudOps / DevOps role.
- Hands on experience with AWS or any public cloud (Azure, GCP etc.).
- Knowledge of Linux, security and networking fundamentals.
- Working knowledge of container-based architecture and deployment (Docker, Kubernetes.)
- Working knowledge of deployment automation development (Terraform, Helm, ArgoCD).
- Experience in diagnosing and resolving complex application problems.
- Working knowledge of Elasticsearch, PostgreSQL, Redis, Ignite, Flink, Kafka, and RabbitMQ.
- Experience with monitoring tools (Nagios, Grafana, Prometheus)
- Experience with cloud security and compliance implementation is a plus.
- Strong follow-through and initiative to stay with issues until they are resolved.
- Comfortable working within a distributed team located in multiple time zones.
- Salary based on region, qualifications and experience up to USD 160,000 - 200,000
