Cloud Infrastructure Staff Engineer

San Francisco, CAFull-TimeStaffSoftware Engineering

You will be redirected to the company career page

Responsibilities

  • Build and Manage Kubernetes Platform on AWS: Develop, maintain, and scale an AWS and Kubernetes-based infrastructure that supports all backend applications at PayJoy. Design, implement, and optimize cloud and container-based solutions to ensure high availability, resilience, and cost-effectiveness. Conduct code reviews, manage infrastructure as code (IaC), and implement CI/CD pipelines to promote best practices for code quality, reliability, and security.
  • Develop and Enable Code Quality Standards: Design and implement platform features and enhancements to meet application and developer needs, prioritizing code scalability, cost optimization, and automated testing. Write and review code to ensure it meets high standards of quality, robustness, and scalability. Act as a technical mentor, providing guidance to team members on writing efficient, maintainable code, including reviews and paired programming as necessary.
  • Lead CI/CD and DevOps Practices: Design and maintain CI/CD pipelines, including Docker image creation, Kubernetes deployment artifacts, and environment provisioning. Establish and maintain logging, monitoring, and alerting systems to streamline application development and deployment processes. Collaborate with teams to reduce downtime and improve deployment speed and reliability.
  • Cross-Team Collaboration and Application Onboarding: Partner with product and engineering teams to understand new backend applications and identify onboarding requirements for our platform. Work closely with cross-functional teams to plan, architect, and implement solutions that fit within the broader platform strategy. Evaluate side-effects of onboarding new applications, address any compatibility issues, and create seamless pathways for new app integration.
  • On-Call Rotation and Incident Management: Participate in the on-call rotation, providing technical leadership in incident response and resolution. Conduct thorough post-mortem analysis of incidents, document findings, and implement process improvements to prevent future issues. Work alongside SREs and other engineers to troubleshoot issues, triage incidents, and perform root-cause analysis to continuously improve platform reliability.
  • Optimize Developer Productivity and Resource Utilization: Develop tools, templates, and documentation to make it easier for developers to work on the platform, enhancing productivity. Monitor and analyze platform performance to identify cost-saving opportunities and ensure efficient resource usage. Drive automation initiatives to reduce manual intervention, streamline operations, and free up developer time for core product work.

Requirements

  • 12+ years of experience in software engineering or DevOps, with a strong background in cloud infrastructure (AWS preferred), Kubernetes, and CI/CD.
  • Technical proficiency in Python, Go, or another backend language, with experience writing production-grade code and implementing scalable solutions.
  • Hands-on experience with Docker, Kubernetes, and container orchestration; strong familiarity with Helm, Terraform, and other IaC tools.
  • Proven experience leading or mentoring engineering teams, fostering a collaborative and productive work environment.
  • Solid knowledge of modern DevOps practices, including infrastructure as code, monitoring, logging, and security best practices.
  • Excellent problem-solving skills, with a proactive approach to incident resolution and performance optimization.
  • Ability to communicate technical concepts effectively to non-technical stakeholders, with strong collaboration skills across diverse teams.

Preferred skills

  • Experience with DataDog, Prometheus, Grafana, or similar monitoring and logging tools.
  • Knowledge of financial technology (fintech) and the specific requirements around security, compliance, and data protection.
  • Prior experience in incident management and participating in on-call rotations.

Benefits

  • 100% Company-funded health insurance for employee and immediate family
  • Company-funded employee life and disability insurance
  • 20 Paid vacation days, Flexible sick leave
  • $2,000 USD annual Co-working Travel perk
  • $2,000 USD annual Professional Development perk
  • Headphone benefit, home office equipment allowance and wellness perks
  • $340 Company-funded Commuter benefit
  • Catered lunches

Job Summary

CompanyPayJoy
LocationSan Francisco, CA
TypeFull-Time
LevelStaff
DomainSoftware Engineering