Senior Machine Learning Infrastructure Engineer
Santa Clara, CAFull-TimeSeniorAI / Data Science
Responsibilities:
- Design and develop scalable, high-performance systems for training, inference, deploying, and monitoring ML models at scale.
- Build and maintain efficient data pipelines, model versioning systems, and experiment tracking frameworks.
- Collaborate with cross-functional teams, including ML researchers and engineers, to identify bottlenecks and improve platform usability.
- Implement distributed systems and storage solutions optimized for machine learning workloadsDrive improvements in CI/CD workflows for ML models and infrastructure.
- Ensure high availability and reliability of the ML platform by implementing robust monitoring, logging, and alerting systems.
- Stay current with industry trends and integrate relevant tools and frameworks to enhance the platform.
- Mentor junior engineers and contribute to a culture of technical excellence
- Ensure that your work is performed in accordance with the company’s Quality Management System (QMS) requirements and contribute to continuous improvement efforts.
- Ensure team compliance with QMS, monitor quality, and drive process improvements.
Required Skills:
- Phd or MS in Computer Science, Electrical Engineering, or related field
- Good oral and written communication skills
- Phd new grad or Masters with 3+ years of software engineering experience with a focus on ML infrastructure or distributed systems.
- Proficiency in in Python, C++, SQL
- Deep understanding of containerization, orchestration technologies, distributed ML workload, and experiment tracking tools (e.g., Docker, Kubernetes, multiprocessing, Kubeflow, and mlflow)
- Deploy and manage resources across multiple cloud platforms (AWS, GCP, or on-prem environments)
- Proficiency in at least one deep learning framework, such as PyTorch and data pipeline tools (e.g., Apache Airflow, Prefect).
- Strong knowledge of distributed systems, databases, and storage solutions.
- Extensive software design and development skills.
- Ability to learn and adapt to new technologies and contribute in a productive environment.
Preferred Skills:
- Familiarity with fundamental deep learning architectures, such as Convolutional Neural Networks (CNNs) and Transformer models
- Experience in building large-scale ML datasets, MLOps pipelines, and distributed computing frameworks like Ray
- Experience working with autonomous vehicles or robotics
Salary Range:
- $160,000 - $200,000 a year
