Join Us
Site Reliability Engineer
Experience: 3-8 years
Location: Permanent WFH (Remote)
Employment Type: Full-Time
About Us: We are a forward-thinking technology company committed to delivering high-performance, scalable, and reliable systems. We are seeking an experienced Site Reliability Engineer (SRE) to join our team, ensuring the stability and efficiency of our infrastructure and services.
Key Responsibilities:
System Reliability and Performance:
- Design, implement, and maintain highly available and scalable systems.
- Monitor system performance, identify issues, and proactively resolve them.
- Conduct root cause analysis for incidents and implement preventive measures.
Automation and Efficiency:
- Develop and maintain automation scripts and tools to streamline operations and reduce manual interventions.
- Implement infrastructure as code (IaC) practices using tools like Terraform, Ansible, or similar.
Collaboration and Support:
- Work closely with development and operations teams to enhance system reliability and performance.
- Provide technical support and guidance to other team members on best practices and troubleshooting techniques.
- Participate in on-call rotations to ensure 24/7 support for critical systems.
Monitoring and Incident Management:
- Set up and maintain monitoring and alerting systems to detect and respond to incidents promptly.
- Manage and respond to incidents, ensuring timely resolution and minimal impact on users.
- Document incident reports and contribute to post-mortem analysis to drive continuous improvement.
Capacity Planning and Optimization:
- Perform capacity planning to ensure systems can handle peak loads and future growth.
- Optimize resource utilization and performance to reduce costs and improve efficiency.
Qualifications:
- Education:
- Bachelor’s degree in Computer Science, Information Technology, or a related field.
- Experience:
- 3-8 years of experience in site reliability engineering, DevOps, or a related role.
- Proven experience in managing large-scale, high-availability systems.
- Skills:
- Proficiency in scripting languages such as Python, Bash, or similar.
- Strong knowledge of Linux/Unix systems and networking.
- Experience with cloud platforms such as AWS, Azure, or Google Cloud.
- Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes.
- Experience with CI/CD pipelines and tools like Jenkins, GitLab CI, or similar.
- Strong problem-solving skills and attention to detail.
- Excellent communication and collaboration skills.
Preferred Qualifications:
- Experience with configuration management tools like Ansible, Puppet, or Chef.
- Knowledge of database systems and caching technologies.
- Familiarity with observability tools like Prometheus, Grafana, ELK stack, or similar.
- Understanding of security best practices and compliance requirements.