Site Reliability Engineer

5-7 Years

This job is no longer accepting applications

Job Description

Who We Are

Finance leaders choose Billtrust to get paid faster, control costs, and maximize customer satisfaction. As the leader in B2B accounts receivable workflow and payment software, we provide the world's leading brands with AI-powered solutions across the full AR lifecycle—from invoice presentment and payment processing to cash application and collections. With over 2,600 global customers, more than $1 trillion in invoice dollars processed, and a proprietary network of 13 million buyers, Billtrust delivers business value through deep industry expertise and a culture relentlessly focused on meaningful customer outcomes.

We're an AI-first company, not just in what we build for our customers, but in how we work. Across every function, our teams use AI tools daily to work faster, make better decisions, and deliver higher-quality outcomes. We hire exceptional people, give them cutting-edge AI capabilities, and measure success by the impact they create. If you want to do the best work of your career at the frontier of AI and fintech, Billtrust is the place to do it.

Our Values

Customers

We relentlessly increase value for customer and do the right thing for them.

Action

We make thoughtfully fast decisions, act quickly, cut through red tape, deliver progress not perfection, take ownership and accountability.

Team Spirit

We put the team ahead of ourselves, foster trust and respect, collaborate with passion, despise toxic politics, value our differences, and celebrate together.

Innovation

We challenge the status quo, experiment thoughtfully, and are novel and brilliant in what we create.

Excellence

We love to win, but we hate losing even more. We aspire to be the best and take pride in our work. When we fall short, we own it and come back stronger.

Site Reliability Engineer

As a Site Reliability Engineer within our Operations Engineering Center, you'll ensure the reliability, scalability, and performance of Billtrust's infrastructure that powers mission-critical order-to-cash operations. You'll participate in our follow-the-sun SRE coverage across time zones. You'll respond to incidents, implement monitoring and alerting strategies, and engineer autonomous incident response systems through agentic runbooks and intelligent triage. Your work will directly impact billions of dollars in transactions processed through our platform while pioneering AI-driven operational excellence.

Key Responsibilities

Respond to incidents, perform root cause analysis, and lead post-mortem discussions
Implement and maintain comprehensive monitoring, alerting, and observability across infrastructure
Establish and maintain SLO frameworks, tracking and improving reliability metrics
Engineer autonomous alert triage agents and agentic runbooks for incident response
Design and build intelligent incident correlation engines using AI/ML techniques
Develop and maintain infrastructure automation, CI/CD pipelines, and deployment procedures
Manage Kubernetes clusters, container orchestration, and cloud platform resources (AWS)
Lead toil reduction initiatives through automation, focusing on high-impact pain points
Collaborate with platform and product teams on infrastructure requirements and capacity planning

Required Qualifications

Experience & Technical Background

5+ years of hands-on experience in Site Reliability Engineering or infrastructure operations
Strong proficiency with Linux/Unix systems administration and shell scripting
Experience with cloud platforms (AWS preferred, Azure or GCP acceptable)
Hands-on Kubernetes and container orchestration experience
Demonstrated expertise in incident response, troubleshooting, and post-mortem analysis
Strong background with monitoring tools (Datadog, Prometheus, Grafana, PagerDuty)
Experience with infrastructure automation and infrastructure-as-code tools (Terraform)
Proficiency with at least one programming/scripting language (Python, Go, Bash preferred)
Proficiency using Claude Code, GitHub Copilot or similar AI coding assistance

Soft Skills & Attributes

Excellent communication skills, particularly during high-stress incident situations
Problem-solving mindset with focus on automated solutions over manual workarounds
Reliability-first mentality with attention to detail and systems thinking
Ability to thrive in a distributed, follow-the-sun team environment
Comfort with on-call responsibilities and 24x7 operational commitment

More Info

Job Type:

Industry:

Function:

Employment Type:

About Company

Billtrust

Job ID: 148884871

Jobs by Skill - IT

Jobs by Skill - Non IT

3-7 yrs

Hyderabad, India

Skills:

Servicenow, Prometheus, Slas, Grafana, Docker, Terraform, Application Support, Python, PowerShell, Bash, Sql, Git, Linux, Splunk, Kubernetes, Apm, ITIL-style incident change processes, SLIs, Cloud fundamentals, Networking fundamentals, Error budgets, AI-assisted investigation, Elastic ELK, Production Operations, Azure services, SLOs, Azure Monitor