The vendor pitch is consistent: "Replace 80% of your customer service team with AI agents." After deploying AI customer service for half a dozen e-commerce brands, the real number is closer to 30-60% — and that's only after careful setup, with significant human oversight, and only for certain types of inquiry. Here's where AI agents actually pay back, where they fall flat, and how to deploy them sensibly.
Past the hype
AI customer service tools (Gorgias AI, Intercom Fin, Zendesk AI, custom GPT deployments) have improved dramatically in 2025-2026. Modern agents can:
- Look up specific order details from Shopify
- Check inventory and answer "is it back in stock?"
- Process refunds for orders under a defined threshold
- Handle returns including label generation
- Answer product questions from documentation
- Triage and route complex issues to humans
What they're still bad at: nuanced emotional situations, complex multi-issue tickets, anything requiring policy judgment, and any case where the customer is already frustrated.
Where AI agents win
The clearest wins are inquiry types that are high-volume, repetitive, and data-lookups:
1. WISMO (Where Is My Order)
Typically 30-50% of customer service volume in e-commerce. The customer wants tracking information. AI looks up the order in Shopify, retrieves tracking from the carrier, responds in seconds. Resolution rate: 92-98% without human handoff.
2. Sizing and product questions
"What size am I if I'm 5'10 and 75kg?" Trained on your size guide, AI handles this consistently. Resolution rate: 70-85%, depending on product complexity.
3. Returns and exchanges
"How do I return this?" AI confirms eligibility (order date, condition policy), generates the return label, sends the instructions. Resolution rate: 80-90% for standard returns.
4. Stock and availability
"Is X back in stock?" AI checks inventory, gives an honest answer, optionally signs the customer up for restock notification. Resolution rate: 90%+.
5. Order modifications (limited)
"Can I change my shipping address?" AI checks if order has shipped, modifies if possible, escalates if not. Resolution rate: 70-80%.
For our chatbot deployment on Sleepy UK, these five categories accounted for 76% of total ticket volume — and AI resolved 73% of that without human handoff. Net deflection: roughly 56% of total volume.
Where they fail
Equally important — the inquiry types where AI is currently worse than human:
1. Damaged or defective products
The customer wants empathy, ownership, and a clear resolution path. AI can technically process the replacement, but the customer experience suffers. Negative reviews of AI handling damaged-product complaints are common.
2. Multi-issue tickets
"I ordered three items, two arrived but one's the wrong size, and also I never got my newsletter discount." AI struggles to track multiple sub-issues simultaneously. Resolution rate drops below 30%.
3. Complaints about policy
"Why did you charge me restocking fee?" AI either capitulates (and the customer learns to escalate) or rigidly defends policy (and the customer hates you). Humans can navigate this gray zone better.
4. VIP and high-value customers
The customer who's spent £3,000 with you deserves a human. AI doesn't (yet) reliably read the room well enough to know when to escalate immediately.
5. Anything requiring judgment
"My package is at the wrong address — should we wait for delivery or send a replacement?" Requires risk assessment, not policy lookup. AI guesses, badly.
If the optimal response requires you to think for more than 30 seconds about the right answer, AI is going to handle it badly. Route to human.
The hybrid model that works
The deployment pattern we've settled on after multiple builds:
- AI first contact — every inbound ticket goes to AI initially
- AI handles or escalates — within 1-2 exchanges, AI either resolves or escalates with a summary
- Smart escalation triggers — specific keywords ("damaged", "broken", "manager"), customer LTV > £500, or detected frustration trigger immediate human routing
- Human handles emotional and complex — humans focus on the 30-40% of tickets that need actual judgment
- AI summarises for human — when escalating, AI gives the human a summary so they don't have to re-read the conversation
This pattern preserves the deflection benefit (most volume handled by AI) without the customer-experience downsides of pure AI handling.
Real cost comparison
For a brand with 1,000 support tickets/month:
| Scenario | Monthly cost | Notes |
|---|---|---|
| 1 full-time CS rep, no AI | ~£2,800 (salary + tools) | Handles ~1,000 tickets/month at typical pace |
| 1 part-time rep + Gorgias AI | ~£1,800 (salary + tool fees) | AI deflects ~50%, rep handles rest |
| Custom AI agent + 1 part-time rep | ~£1,600 (build amortised + LLM costs + salary) | Higher deflection, requires upfront build |
| AI-only (no human backup) | ~£400 (tool fees + LLM costs) | Customer experience suffers, 20%+ refund requests |
For brands with under 200 tickets/month, AI doesn't pay back — the time savings don't justify setup complexity. For brands with 500+ tickets/month, AI is essentially mandatory to scale support cost-effectively.
Practical implementation
Off-the-shelf path
Gorgias AI or Intercom Fin. Setup time: 2-4 weeks. Cost: £200-£600/month plus per-resolution fees. Best for brands wanting AI capability without engineering effort.
Custom path
Build your own using OpenAI or Anthropic APIs, integrated with your support stack. Setup time: 6-12 weeks. Cost: £15-40k upfront, then £100-400/month in LLM tokens. Best for brands with unique workflows or wanting to own the data layer.
What we recommend
Start with off-the-shelf. Live with it for 6 months. If the deflection rate plateaus below 50% and you have technical constraints the tool can't handle, then consider custom. Most brands never reach that threshold and shouldn't build custom.
Considering AI customer service?
Free 30-min consult to assess whether AI fits your support volume and workflow.
Book a free call →"AI customer service is a margin lever, not a strategy. Treat it as one input into your support model, not the model itself."