As a Senior Performance and Capacity Engineer, you will work closely with the Site Reliability Engineering team to provide accurate and insightful capacity projections for the senior management team. This role is critical for maintaining server and network resources needed to serve customers across the Edge Delivery and Edge Compute platforms. You will lead the effort to deliver accurate and timely capacity checks for new and growing customer deployments. You will also engage in performance troubleshooting to identify and remove live bottlenecks in the delivery environment.
This role will report to: VP Site Reliability Engineering
Essential Duties and Responsibilities
- Handle complex enterprise issues, which often cross-system, network, and software boundaries.
- Design, develop and maintain internal service metrics (SLA, SLO, SLI) in cross-team collaborations.
- Design, develop and maintain dashboards, tooling, alarms, and playbooks in collaboration with operations teams to support service-level objectives.
- Design, develop and maintain reusable monitoring and canary infrastructure.
- Design, execute and evaluate performance experiments.
- Collaborate with operations and engineering teams in determining the root cause of major incidents, performance anomalies, or other customer-impacting issues.
- Discover and analyze system performance-related bottlenecks.
- Discover and analyze anomalies and system issues, with the goal of figuring out root causes and mitigating them.
- Writing ETLs to extract performance-related KPIs and presenting the said KPIs in a systematic manner.
- Capacity planning using regressive machine learning models, and other statistical methods when applicable.
- Automating everyday repeatable items.
- Modeling Traffic Growth and making server purchasing recommendations.
- Develop enterprise client traffic flow modeling, distribution, and capacity checks.
- Direct and participate in automation of performance and capacity checks and need for capacity augmentation.
Desired Skills and Experience
- High-level knowledge of Linux and operating systems.
- High level of WAN networking knowledge.
- Scripting languages (Bash, Python, PHP, Perl).
- Experience with Prometheus.
- Experience with Grafana, Docker, GCP, Telegraf, and Tableau.
- DB knowledge (MySQL, PostgreSQL, TimeScaleDB and others)
- High-level understanding of Statistics.
- High-level understanding of Machine Learning.
- Experience with traffic analyzing tools (Catchpoint, Kentik, Cedexis...)
- Experience with CI/CD/CM tools (Jenkins, Ansible, Puppet, Chef...)
- Experience with Virtualization ( KVM, QEMU...)
What we offer
- Convenient office location to all major public transportation lines
- Training sessions for the product and tools we’re using
- Plenty of office events such as happy hours and learning sessions
- Plenty of opportunities while we grow and scale