
Search by job, company or skills
Designation: Senior Application Performance Engineer
Preferred Experience: 5+ years
Summary:
We are looking for a Senior Application Performance Engineer - a software engineer first - to build and own our observability foundation, then use it to systematically find and eliminate performance problems across our platform. This role requires someone who writes code, reads application code fluently, and treats performance and observability as engineering problems rather than configuration tasks.
The primary mandate is instrumentation and monitoring: you will design SLOs, build out Datadog observability, and establish the visibility layer that engineering teams need to make data-driven decisions. The secondary focus is diagnostic and corrective - once the signals are in place, you will investigate slow API response times, JVM misconfiguration, and application bottlenecks in our Java/GKE stack, building proof-of-concept fixes and driving improvements through to resolution.
Our current monitoring is shallow - we have basic uptime checks but no meaningful SLOs, and JVM settings across services are not consistently tuned to their container environments. This role builds that foundation from the ground up.
Responsibilities:
Observability & SLO Design (Primary)
● Design and implement SLOs and SLIs across core services - API latency, error rates, availability, and throughput - establishing baselines that make performance regressions visible
● Build Datadog APM instrumentation, distributed tracing, custom metrics,
● dashboards, and alert configurations that give engineering teams actionable signal
● rather than noise
● Evolve uptime monitoring beyond basic endpoint checks toward meaningful user
● journey and service dependency coverage
● Instrument Java services running on GKE with appropriate tracing and metrics,
● ensuring Datadog reflects real application behavior including JVM internals, GCS
● operations, and MySQL query performance
Performance Investigation & Optimization (Secondary)
● Use observability data to identify and prioritize performance bottlenecks across
● Java services, MySQL queries, and GCS interactions
● Profile Java services to investigate CPU, memory, and latency issues - including
● JVM misconfiguration where heap sizing, Metaspace limits, or GC tuning are
● misaligned with GKE container limits
● Investigate OOMKill events on batch and long-running workloads; distinguish
● JVM heap exhaustion from Metaspace or native memory pressure
● Build proof-of-concept fixes; implement smaller optimizations directly and hand
● off larger scope changes to product engineering teams with clear documentation Reliability Partnership ● Partner with engineering teams during incident investigations to root-cause
● application-level issues using observability tooling
● Contribute to capacity planning discussions backed by real performance data
● Participate in architecture reviews from a performance and observability
● Perspective
● Identify and reduce toil through automation where observability gaps create
● recurring manual work
Requirements and skills:
● 5+ years as a software engineer with a focus on performance and observability—candidates from pure DevOps or cloud operations backgrounds without strong application engineering experience are not a fit
● Hands-on Datadog expertise: APM instrumentation from scratch, distributed tracing, custom metrics, monitor and alert design, and dashboard construction - not just reading existing dashboards
● Proven ability to define SLOs and SLIs from scratch across multiple services; understands what makes a meaningful SLI versus a vanity metric
● Strong Java proficiency - you can read, profile, and optimize Java application code, not just configure infrastructure around it
● Deep understanding of JVM internals: heap vs. non-heap memory, GC behavior, container-aware JVM flags (-Xmx, MaxMetaspaceSize, UseContainerSupport), and how Java versions handle cgroup memory limits in GKE
● Experience in instrumenting Java services on Kubernetes; understanding of pod resource limits, OOMKill forensics, and GKE-specific behavior
● Familiarity with MySQL query performance analysis - slow query identification, index usage, and how DB latency surfaces in application traces
● Experience with Google Cloud Storage (GCS) in an application context - understanding how GCS operations contribute to latency and how to instrument them
● Strong written communication in English; able to document findings, SLO definitions, and handoff artifacts that engineering teams can act on independently
Nice to Have
● Experience with Google Cloud Platform broadly (GKE, Cloud SQL, GCS, Cloud Logging)
● Familiarity with Kafka or other async messaging in a reliability context
● Experience with Firebase in a performance or observability context
Company Overview:
Aventior is a leading provider of innovative technology solutions for businesses across a wide range of industries. At Aventior, we leverage cutting-edge technologies like AI, ML Ops, DevOps, and many more to help our clients solve complex business problems and drive growth.
We also provide a full range of data development and management services, including Cloud Data Architecture, Universal Data Models, Data transformation and ETL, Data Lakes, User Management, Analytics and visualization, and automated data capture (for scanned documents and unstructured/semi-structured data sources). Our team of experienced professionals combines deep industry knowledge with expertise in the latest technologies to deliver customized solutions that meet the unique needs of each of our clients. Whether you are looking to streamline your operations, enhance your customer experience, or improve your decision-making process, Aventior has the skills and resources to help you achieve your goals.
We bring a well-rounded, cross-industry, and multi-client perspective to our client engagements. Our strategy is grounded in design, implementation, innovation, migration, and support. We have a global delivery model, a multi-country presence, and a team well-equipped with professionals and experts in the field.
Job ID: 149882751
We don’t charge any money for job offers