We are seeking a highly skilled and motivated Cloud Infrastructure Engineer with a strong focus on MLOps to join our innovative team. In this fully remote role, you will be instrumental in designing, building, and maintaining the scalable, reliable, and secure cloud infrastructure that powers our cutting-edge machine learning initiatives. You will work within the Americas time zones, collaborating closely with ML Engineers and Data Scientists to optimize our MLOps platforms and ensure seamless deployment and operation of ML models from experimentation to production.
Your expertise will directly impact our ability to deliver advanced AI/ML solutions, ensuring our infrastructure is robust, efficient, and future-proof. If you are passionate about cloud technologies, MLOps, and automation, and thrive in a dynamic, collaborative environment, we encourage you to apply.
Key Responsibilities:
- Design, implement, and manage highly available and scalable cloud infrastructure (primarily AWS or GCP) to support machine learning workloads using Infrastructure as Code (IaC) principles with Terraform.
- Deploy, configure, and optimize MLOps platforms such as Kubeflow and integrate with experiment tracking tools like MLflow.
- Develop, maintain, and enhance data pipelines and workflow orchestration using tools like Apache Airflow to automate ML model training, evaluation, and deployment.
- Implement and manage CI/CD pipelines for infrastructure changes and ML model deployments, ensuring rapid and reliable delivery.
- Monitor the performance, security, and cost-efficiency of our cloud environments and MLOps tools using solutions like Prometheus, Grafana, or cloud-native monitoring services.
- Collaborate with Machine Learning Engineers and Data Scientists to understand their infrastructure needs, troubleshoot issues, and provide expert guidance on cloud best practices.
- Ensure compliance with security policies and industry best practices for data governance and privacy within the cloud infrastructure.
Required Skills & Tools:
- Strong hands-on experience with a major cloud provider (AWS, GCP, or Azure), including core compute, storage, networking, and managed services.
- Proficiency in containerization technologies (Docker) and orchestration (Kubernetes).
- Extensive experience with Infrastructure as Code (IaC) tools, particularly Terraform.
- Demonstrable experience with MLOps platforms and tools such as Kubeflow, MLflow, or similar.
- Solid understanding and experience with workflow orchestration tools like Apache Airflow.
- Proficiency in scripting languages (e.g., Python, Bash).
- Experience building and maintaining CI/CD pipelines for infrastructure and application deployment (e.g., GitLab CI, Jenkins, ArgoCD).
- Familiarity with monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, ELK Stack).
- Strong understanding of networking concepts, security principles, and cost optimization in cloud environments.
Nice-to-Haves:
- Experience with data warehousing solutions (e.g., Snowflake, BigQuery, Redshift).
- Knowledge of advanced security practices, including identity and access management (IAM) and network security groups.
- Certifications in cloud platforms (e.g., AWS Certified DevOps Engineer, Google Professional Cloud DevOps Engineer).
- Contributions to open-source MLOps projects or community.
What We Offer:
We offer a competitive hourly rate of USD 60 – 90 / hour, commensurate with experience. This is a fully remote position, allowing you to work from anywhere in the Americas time zones. Join a forward-thinking team where innovation is at the core of everything we do. You'll have the opportunity to work with cutting-edge technologies, contribute to impactful projects, and grow your skills in a supportive and collaborative environment that values continuous learning and professional development.