Toronto-based Jatheon Technologies Inc, the provider of premium technological solutions for the long-term archival and monitoring of enterprise information is in need of an experienced
Senior Site Reliability Engineer
Serbia, RS
Remote
Full-Time
Your role will be to ensure that our email archiving cloud service, Jatheon Cloud, has reliability and uptime appropriate to users' needs and a fast rate of improvement while making sure the capacity and performance are impeccable. You will use your engineering expertise to run better production systems, optimize existing ones, build and improve infrastructure, make deployments easier and eliminate work through automation.
You will implement practices that will limit the time spent on operational work, work on reducing AWS costs and proactively identify potential outages that will both be key to product quality and make your work interesting and dynamic on a day-to-day basis.
Responsibilities
- Monitor site stability and performance by measuring and monitoring availability, latency, and overall system health
- Create solutions to improve performance, scalability, and reliability
- Mitigate issues on production systems and build solutions through automation to prevent them from reoccurring
- Plan capacity, reliability, and security while optimizing costs
- Set and make sure we meet our SLA (Service Level Agreement)
- Reduce the number of incidents
- Reduce AWS cost by investigating the optimal usage of available resources and suggesting improvements
- Monitor and observe the platform, be able to anticipate potential problems and apply prevention steps
- Automate common, recurring tasks using scripting languages
- Secure the infrastructure and product with industry standard best practices
- Practice sustainable incident response and blameless postmortems
- Conduct post-outage investigation and create reports
- Suggest/provide procedures in various types of incidents in order to rapidly resolve issues
- Occasionally organize "fire alarm drills" to check Support Team’s response time
Must Haves
- 5+ years of hands-on systems administration experience on Linux platforms
- Experience using AWS in a production environment (alternatively Azure, GCE)
- Experience building and orchestrating containerized services (Kubernetes and Docker)
- Knowledge of Shell scripting
- Experience with automation tools like Ansible or Cheff
- Ability to systematically test and improve production based on tests
- Ability to minimize downtime and downtime risk by implementing a set of tools/procedures
- Ability to collect, correlate and report based on metrics in a centralized logging platform (ELK, Splunk)
- Strong communication skills with an ability to relay incident details concisely and accurately
- Highly motivated, quality conscious self-starter that requires little to no supervision
- Motivation to learn new technology and the ability to adapt
- Excellent verbal and written communication skills in English
Bonus points
- Working knowledge of ElasticSearch
- Knowledge of programming languages like Python, Go
What we offer
- The opportunity to work anytime ‒ flexible working hours
- The opportunity to work anywhere ‒ remote position
- A computer and other equipment
- Great working atmosphere with regular team building activities
- A chance to be a part of a casual, but highly professional international team
- The opportunity to learn and share knowledge with experienced colleagues
- Conferences and events
- Competitive compensation depending on experience and skills
- Exposure to an outstanding set of new technologies
Deadline for applications: 27.02.2019.