Principal Architect - Infrastructure
Remote US, USAFull-TimeStaffOther
Responsibilities
- Architect and scale enterprise-grade AKS clusters built for high concurrency, performance, and real-time AI inference, ensuring the platform is globally distributed and highly available.
- Leverage Crossplane for Kubernetes-native provisioning of Azure services, creating a Kubernetes-native control plane for rapid scaling of AI services.
- Champion GitOps practices with Argo CD to standardize deployments across multiple environments and regions, enabling reliable, automated delivery of mission-critical SaaS workloads.
- Engineer infrastructure that supports data-intensive AI/ML pipelines, integrating compute, storage, and messaging with Kubernetes to power real-time decision intelligence use cases.
- Optimize scalability and concurrency with autoscaling, pod disruption budgets, and advanced workload scheduling, ensuring millions of daily requests are served with low latency.
- Develop and maintain automation, tooling, and integrations using Python, Ruby, and Terraform, enabling teams to scale infrastructure and AI services efficiently.
- Design and enforce secure, compliant, multi-tenant architectures with Azure AD SSO, managed identities, RBAC, and Key Vault integration.
- Build resilient networking topologies with VNets, VNet peering, Private Link, and service mesh technologies (e.g., Istio, Linkerd) and emissary ingress for advanced security and reliability.
- Integrate observability frameworks at scale using Prometheus, Grafana, Azure Monitor, and OpenTelemetry, providing deep visibility into performance, availability, and latency.
- Collaborate closely with AI/ML engineering teams to align infrastructure with real-time inference and streaming data requirements, enabling cutting-edge decision automation.
- Mentor engineering and operations teams while documenting and evangelizing Kubernetes-native and Azure-native best practices, driving innovation across the organization.
About You
- 10+ years of cloud infrastructure experience with expert-level skills in Kubernetes and Azure.
- Proven experience designing and operating multi-tenant SaaS platforms where performance, scalability, and security are critical.
- Hands-on expertise with Crossplane for Kubernetes-controlled Azure service provisioning.
- Deep familiarity with Azure services: AKS, AzureFlexible MySQL, Blob Storage, Event Hubs, Key Vault, etc.
- Strong coding and automation background with Github Actions, Python, and Terraform, plus experience with other high-level programming and scripting languages.
- Skilled in Infrastructure as Code (Terraform, Crossplane, Helm) and GitOps (Argo CD).
- In-depth knowledge of Kubernetes networking, autoscaling, and workload orchestration for AI/ML inference workloads.
- Proficiency with observability tooling: Prometheus, Grafana, Azure Monitor, and OpenTelemetry.
- A collaborative leader who thrives on mentoring and enabling teams, with excellent communication and documentation skills.
- Motivated to build the core infrastructure behind AI-powered decision intelligence at global scale, driving meaningful impact for some of the world’s most recognized brands.
Nice to Have
- Background with large-scale, real-time data streaming platforms.
- Prior collaboration on AI/ML infrastructure platforms or decision intelligence systems.
- Contributions to open-source projects, especially in Kubernetes or the cloud-native ecosystem.
