We are seeking a highly skilled and passionate MLOps Engineer to join our innovative, remote-first team. Based out of San Francisco, CA, this full-time role offers an exciting opportunity to build and maintain robust, scalable machine learning infrastructure. You will be instrumental in bridging the gap between data science, ML engineering, and operations, ensuring our cutting-edge AI models seamlessly transition from development to production and operate reliably at scale. If you thrive on automating complex ML workflows and optimizing performance, this role is for you.
Key Responsibilities:
- Design, implement, and maintain end-to-end MLOps pipelines (CI/CD) for machine learning models, ensuring automation, reproducibility, and version control from experimentation to deployment.
- Administer and optimize our machine learning infrastructure primarily on Google Cloud Platform (GCP), leveraging services like Vertex AI, GKE, and Cloud Build.
- Develop and manage infrastructure as code using tools such as Terraform and Ansible to provision and configure cloud resources efficiently.
- Implement monitoring, logging, and alerting solutions for ML models in production, using tools like Prometheus and Grafana to ensure high availability and performance.
- Collaborate closely with data scientists and ML engineers to understand model requirements and translate them into robust, production-ready MLOps solutions.
- Automate model deployment, scaling, and rollback strategies using containerization (Docker) and orchestration (Kubernetes, Helm charts).
- Ensure best practices for security, data governance, and compliance are integrated into all MLOps processes and systems.
Required Skills & Tools:
- Strong experience with cloud platforms, specifically GCP, including services relevant to ML (e.g., Vertex AI, GKE, Cloud Storage, BigQuery).
- Proficiency in MLOps principles and practices, with hands-on experience building and managing ML pipelines.
- Expertise with CI/CD tools such as Jenkins, GitLab CI, or GitHub Actions.
- Solid experience with Infrastructure as Code (IaC) tools like Terraform and configuration management with Ansible.
- Proficiency in scripting and programming, primarily Python, for automation and ML model integration.
- Extensive experience with containerization technologies (Docker) and orchestrators (Kubernetes), including package management with Helm.
- Familiarity with ML experiment tracking (e.g., MLflow, Kubeflow) and data/model versioning.
- Understanding of data engineering concepts and distributed data processing frameworks.
Nice-to-Have Skills:
- Experience with other major cloud providers like AWS or Azure.
- Familiarity with advanced ML frameworks and libraries (e.g., TensorFlow Extended, PyTorch Lightning).
- Knowledge of streaming data technologies (e.g., Kafka, Pub/Sub).
- Experience with MLOps platforms beyond GCP's native offerings.
What We Offer:
- A competitive annual salary ranging from USD 80,000 – 110,000, commensurate with experience.
- Comprehensive health, dental, and vision insurance for you and your family.
- Generous paid time off, including holidays and sick leave.
- Flexible remote work environment that values work-life balance.
- Opportunities for professional growth, continuous learning, and career development.
- A collaborative, inclusive, and innovative culture where your contributions have a significant impact on groundbreaking AI products.
- Access to cutting-edge technologies and a budget for professional development courses/certifications.