Join Us

Site Reliability Engineer


Experience: 3-8 years
Location: Permanent WFH (Remote)
Employment Type: Full-Time

About Us: We are a forward-thinking technology company committed to delivering high-performance, scalable, and reliable systems. We are seeking an experienced Site Reliability Engineer (SRE) to join our team, ensuring the stability and efficiency of our infrastructure and services.

Key Responsibilities:

  • System Reliability and Performance:

    • Design, implement, and maintain highly available and scalable systems.
    • Monitor system performance, identify issues, and proactively resolve them.
    • Conduct root cause analysis for incidents and implement preventive measures.
  • Automation and Efficiency:

    • Develop and maintain automation scripts and tools to streamline operations and reduce manual interventions.
    • Implement infrastructure as code (IaC) practices using tools like Terraform, Ansible, or similar.
  • Collaboration and Support:

    • Work closely with development and operations teams to enhance system reliability and performance.
    • Provide technical support and guidance to other team members on best practices and troubleshooting techniques.
    • Participate in on-call rotations to ensure 24/7 support for critical systems.
  • Monitoring and Incident Management:

    • Set up and maintain monitoring and alerting systems to detect and respond to incidents promptly.
    • Manage and respond to incidents, ensuring timely resolution and minimal impact on users.
    • Document incident reports and contribute to post-mortem analysis to drive continuous improvement.
  • Capacity Planning and Optimization:

    • Perform capacity planning to ensure systems can handle peak loads and future growth.
    • Optimize resource utilization and performance to reduce costs and improve efficiency.

Qualifications:

  • Education:
    • Bachelor’s degree in Computer Science, Information Technology, or a related field.
  • Experience:
    • 3-8 years of experience in site reliability engineering, DevOps, or a related role.
    • Proven experience in managing large-scale, high-availability systems.
  • Skills:
    • Proficiency in scripting languages such as Python, Bash, or similar.
    • Strong knowledge of Linux/Unix systems and networking.
    • Experience with cloud platforms such as AWS, Azure, or Google Cloud.
    • Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes.
    • Experience with CI/CD pipelines and tools like Jenkins, GitLab CI, or similar.
    • Strong problem-solving skills and attention to detail.
    • Excellent communication and collaboration skills.

Preferred Qualifications:

  • Experience with configuration management tools like Ansible, Puppet, or Chef.
  • Knowledge of database systems and caching technologies.
  • Familiarity with observability tools like Prometheus, Grafana, ELK stack, or similar.
  • Understanding of security best practices and compliance requirements.