We are seeking a highly skilled and passionate Site Reliability Engineer (SRE) to join our globally distributed team. Based remotely in Stockholm, Sweden, you will play a pivotal role in ensuring the reliability, performance, and scalability of our mission-critical, cloud-native platforms. If you thrive on optimizing complex systems, automating everything, and fostering a culture of operational excellence, we want to hear from you.
As an SRE, you will be at the intersection of development and operations, applying software engineering principles to infrastructure and operational problems. You will be instrumental in building and maintaining robust, observable, and resilient systems that serve millions of users worldwide. This role requires a proactive approach to identifying and resolving potential issues before they impact our customers, and a commitment to continuous improvement.
- Design and Implement Automation: Develop and maintain automation tools and pipelines (CI/CD) to streamline deployments, infrastructure provisioning (IaC with Terraform), and operational tasks across our cloud environment (primarily AWS, with some GCP).
- Monitor and Alert: Establish and refine comprehensive monitoring, alerting, and logging solutions using tools like Prometheus, Grafana, ELK stack, and Jaeger to provide deep insights into system health and performance.
- Incident Response & Post-Mortem: Participate in on-call rotations, respond to and resolve critical incidents swiftly, and conduct thorough post-mortems to prevent recurrence and improve system resilience.
- Performance Optimization: Identify and address performance bottlenecks, latency issues, and scalability challenges within our microservices architecture running on Kubernetes.
- Capacity Planning: Work closely with development teams to forecast resource needs, implement efficient auto-scaling strategies, and manage cloud costs effectively.
- Define & Enforce SLOs/SLIs: Collaborate with product and engineering teams to establish meaningful Service Level Objectives (SLOs) and Service Level Indicators (SLIs), and ensure systems meet these targets.
- Knowledge Sharing & Mentorship: Document operational procedures, share best practices, and contribute to a culture of continuous learning and improvement within the team.
Required Skills & Tools:
- Strong proficiency in Linux operating systems and shell scripting (Bash).
- Expertise in at least one high-level programming language (Python or Go preferred).
- Extensive experience with cloud platforms (AWS is a must, GCP or Azure a plus).
- Deep understanding and hands-on experience with containerization and orchestration technologies, especially Kubernetes.
- Proficiency with Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
- Experience with CI/CD pipelines (e.g., GitLab CI, Jenkins, ArgoCD).
- Solid understanding of monitoring and alerting systems (e.g., Prometheus, Grafana, Datadog).
- Strong knowledge of networking concepts, distributed systems, and microservices architecture.
- Excellent problem-solving, analytical, and communication skills.
Nice-to-Have:
- Experience with service mesh technologies (e.g., Istio, Envoy).
- Familiarity with message queues (e.g., Kafka, RabbitMQ) and NoSQL databases.
- Understanding of SRE principles, practices, and methodologies.
- Previous experience working in a fully remote, distributed team environment.
What We Offer:
Join a forward-thinking company that values innovation, collaboration, and work-life balance. We offer a competitive salary (USD 90,000 – 120,000 / year), a comprehensive benefits package, and the flexibility of a full-time remote role. You will have access to cutting-edge technologies, opportunities for professional growth and development, and a supportive environment where your contributions directly impact our global success.