Chainstack is the leading suite of services connecting developers with Web3 infrastructure, powering applications in DeFi, NFT, gaming, analytics, and everything in between.
From startups to large enterprises, Chainstack enables thousands of companies to cut down the time to market, costs, and risks associated with creating and scaling decentralized applications. By offering fast, reliable, and easy-to-use infrastructure solutions distributed globally, we make sure innovators can focus on what’s important.
We are looking for an enthusiastic Cloud Operations Lead with a passion for reliability to lead Cloud Operations team responsible for keeping all user-facing services and other Chainstack production systems running smoothly.
- Providing leadership and technical guidance to the Cloud Operations team of 6-10 people in multiple time zones across APAC, EU, and LATAM regions.
- Owning the reliability aspect of Chainstack production services in the scope of Cloud Operations team
- Managing day-to-day operational tasks, such as maintenance, troubleshooting, automation, and improvement projects
- Driving reliability initiatives around Chainstack production and representing these activities outside the Cloud Operations team
- Collaborate effectively and cross-functionally to drive production issues at all levels
- Identifying automation points and driving efficiency improvement
- Identifying changes from the reliability perspective with a data-driven approach.
- Identifying parts of the system that do not scale, providing immediate workaround measures, and driving long-term resolution
- Generating and implementing process improvements within Cloud Operations team
- Contributing to the hiring process by conducting a technical interview
- Improving documentation all around, explaining the why, not stopping with the what
- 3 or more years of experience in SRE/Cloud Operations/Infrastructure Engineering function supporting a large-scale service(s)
- Experience in operating mission-critical services, which includes being responsible for reliability (SLA/SLO) and managing incidents (monitoring, troubleshooting, escalation)
- Strong production experience on Kubernetes, Helm, Terraform, monitoring solutions (Grafana, Prometheus, InfluxDB, etc), and public cloud providers (AWS, GCP, Azure, etc)
- Proficient on Linux and the shell
- Able to collaborate effectively across the organization
- Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it
- Have the urge to document all the things, so you don't need to learn the same thing twice
- Enthusiasm for providing feedback, teaching others, and learning new techniques
- Professional or personal exposure to Web3 technologies
- Salary in USD
- Stock options
- Bleeding edge tech stack
- Lack of bureaucracy
- Flexible schedule
- Global fast-growing market
- Multinational team