Your AI voice agent is impressive. It handles intent recognition, sentiment analysis, conversational routing, customer lookup against your CRM, scheduling, and escalation — all in real time, all without a human agent. Then the customer says: "Yes, I'd like to pay now."
And your stack hits a wall.
Taking a card payment during a live AI voice call isn't a product problem or a UX problem. It's an infrastructure and compliance problem. This guide explains exactly what the architecture looks like, why the naive approaches don't work, and what a correct integration actually involves.
The Problem: AI Can Do Everything Except Take a Payment
The moment a customer agrees to pay, your voice agent needs to capture a 16-digit card number, a 4-digit expiry, and a 3-digit CVV. That's sensitive cardholder data under PCI DSS. And PCI DSS has a very clear rule: any system that stores, processes, or transmits cardholder data is in scope for full compliance.
Here's what that means in practice for a CCaaS platform:
If card data enters your AI model — even as audio that gets transcribed — your entire voice infrastructure is in PCI scope. That includes your ASR pipeline, your LLM inference layer, your call recording system, your data lake, your transcription storage, your model training pipelines, and every network segment they touch.
PCI DSS Level 1 certification for that kind of footprint costs roughly $500,000 in the first year. Ongoing annual costs run $200,000 or more, plus quarterly vulnerability scans, annual penetration testing, and a Qualified Security Assessor (QSA) who will not be cheap or fast.
That's the compliance burden of getting this wrong. Most CCaaS companies — even well-funded ones — cannot absorb it, nor should they. The goal is to keep cardholder data out of your platform entirely.
The Architecture: Secure Payment Handoff
The correct architecture is a clean handoff: your AI agent orchestrates the conversation, a separate payment layer handles all card data capture, and your platform never sees, stores, or processes cardholder data. Here's the step-by-step flow:
1. Intent recognition — The AI agent identifies payment intent from the customer's speech ("I'd like to pay my bill" / "Can I settle this now?").
2. Amount confirmation — The agent confirms the payment amount with the customer and explains that they'll be prompted to enter their card details via their keypad.
3. Session initiation — Your platform makes an API call to the payment layer: POST /payment-session with the amount, currency, and the end-customer's PSP configuration. The payment layer returns a session token and signals that it's ready to capture.
4. Audio stream split — This is the critical step. The audio stream bifurcates. The payment layer takes control of the DTMF capture channel. The main call audio — the AI agent's voice, the conversation — is either paused or continues on a separate path that is explicitly isolated from the card capture channel.
5. Card data entry — The payment layer plays a secure prompt to the caller ("Please enter your 16-digit card number followed by the hash key"). The caller enters digits via their phone keypad.
6. DTMF capture in isolation — The DTMF tones are captured exclusively by the payment layer. They do not enter your CCaaS platform. They do not reach your AI model. They do not appear in call recordings. The main audio stream receives masked flat tones where the keypad presses would otherwise appear.
7. Tokenisation and authorisation — The payment layer tokenises the card data and sends it to the customer's PSP for authorisation. This entire operation happens within the payment layer's PCI DSS Level 1 certified environment.
8. Result returned — The PSP returns an auth result to the payment layer. The payment layer fires a webhook to your platform: payment_completed with success/failure, a transaction ID, and a masked card reference (e.g., ****4242). No card data.
9. Conversation resumes — Your AI agent picks up: "Your payment of £150 has been processed. Your reference number is TXN-9821. Is there anything else I can help you with?"
The entire card capture happens in a sandboxed environment your platform never touches. Your PCI scope is limited to the API calls between your platform and the payment layer — which is a dramatically smaller, more defensible surface area.
DTMF vs Speech Recognition for Card Capture
When engineers first think about AI voice agent payments, the obvious question is: "Why can't the AI just listen to the customer read out their card number?" It understands speech. It can transcribe numbers.
It can, technically. But compliance architecture doesn't care what the AI can do — it cares what data flows where.
If a customer reads their card number aloud and your ASR pipeline transcribes it, that audio and that transcript both contain cardholder data. Your entire ASR infrastructure is now in PCI scope. Your call recording system is in scope. Your transcription storage is in scope. Your model training data — if you're using call audio for fine-tuning — is in scope.
DTMF (Dual-Tone Multi-Frequency) keypad entry solves this at the architecture level:
Channel isolation: DTMF tones can be captured on a separate audio path that is entirely managed by the payment layer. The main call audio stream never carries card data.
Tone masking: Standard practice is to replace DTMF tones in the main audio stream with flat replacement tones (sometimes called "beeping"). Call recordings contain no card data — they contain silence or flat tones during the card entry window.
No transcription: There's no speech-to-text step for card data. The payment layer decodes the tones directly. No LLM, no ASR, no transcript.
Caller familiarity: Customers are used to entering card details via keypad. It's the standard IVR flow they've been doing for 20 years. There's no UX friction.
Speech capture of card numbers is a compliance anti-pattern. DTMF capture is the industry-standard, compliance-correct approach. Any architecture that routes spoken card numbers through your AI pipeline is building a very expensive PCI scope problem.
PCI Scope: What Changes and What Doesn't
This is worth being precise about, because "PCI compliance" gets hand-waved in a lot of vendor conversations.
Without a payment handoff architecture:
If your agents — human or AI — hear or process card numbers, PCI DSS scope expands to include:
All call recording infrastructure
All transcription services and storage
All ASR pipelines
All AI model inference infrastructure
All data warehouses or lakes that receive call data
All networks connecting these systems
All personnel with access to those systems
That's essentially your entire platform. PCI DSS Level 1 certification for a footprint that size is not a checkbox exercise — it's a multi-year program with dedicated compliance staff.
With a payment handoff architecture:
Your PCI scope shrinks to:
The API connection between your platform and the payment layer (TLS in transit — table stakes)
The payment layer itself (which carries its own PCI DSS Level 1 certification)
You don't handle card data. You don't store it. You don't transmit it. You send a payment session request and receive a success/failure webhook. Your QSA scope is minimal. Your compliance burden is minimal.
The payment layer — Shuttle, in this context — carries the PCI DSS Level 1 certification. That's the certification that covers the card capture, tokenisation, vault, and PSP routing. You inherit the compliance posture without the certification cost.
Multi-PSP: Why Your Customers' Gateway Matters
Here's a practical problem that most "just add Stripe" thinking ignores: your enterprise CCaaS customers already have PSP relationships.
An insurance company processing 50,000 premium collections a month has a negotiated rate with their acquirer. A utility company has a direct integration with a specific gateway. A debt collection agency is contractually required to process through a particular payment provider. None of them want to move off their existing PSP to use whatever you've embedded.
A correct payment layer needs to be PSP-agnostic. When a payment session is initiated for a given end-customer, the payment layer routes to that customer's configured PSP — not to a single hardcoded gateway.
This is why "add Stripe" doesn't solve the problem for CCaaS operators. Stripe is a single gateway. Your enterprise customers need their own gateway. The payment infrastructure needs to support multi-tenancy at the PSP level: each customer of your platform routes through their own PSP, using their own merchant credentials, with their own settlement.
Shuttle supports 40+ PSPs out of the box. When you initiate a payment session, you pass the end-customer's PSP configuration. The payment layer handles the routing. You never need to build a new PSP integration for a new customer.
Build vs Buy
Let's be direct about what building this in-house actually requires:
Build:
DTMF capture with audio stream isolation (non-trivial telephony engineering)
PCI DSS Level 1 certification: ~$500K in year one, $200K+ annually thereafter
Tokenisation vault design, implementation, and auditing
PSP integrations: each one is 2-4 weeks of engineering, plus ongoing maintenance as PSP APIs change
Ongoing quarterly vulnerability scans, annual penetration tests, key rotation schedules
A dedicated compliance function or expensive external QSA relationship
Timeline to first production payment: 12-18 months minimum
Buy (integrate a payment layer):
Single API integration: a few weeks of engineering
PCI compliance carried by the payment layer — you're out of scope
40+ PSP integrations available on day one
Compliance, auditing, pen testing, key rotation: the payment layer's problem
Timeline to first production payment: weeks
For a CCaaS company under 500 people — and most CCaaS companies are — this calculus is not close. The build path is a multi-year distraction from your core product. The buy path lets you ship a payments feature, close enterprise deals that require payment capabilities, and let your engineering team stay focused on the AI and conversation capabilities that actually differentiate your product.
What the Integration Actually Looks Like
Stripped to its essentials, the integration is three API calls and a webhook:
1. POST /payment-session Body: { amount, currency, merchant_id, psp_config } Response: { session_id, dtmf_ready: true }
2. Audio handoff — DTMF capture handled by payment layer]
3. Webhook received: POST /your-webhook-endpoint Body: { event: "payment_completed", session_id: "sess_abc123", status: "success", transaction_id: "txn_xyz789", masked_card: "4242", amount: 15000, currency: "GBP" }
4. AI agent resumes conversation using status from webhook]
```
No card data flows through your system at any point. The session ID ties the payment to the conversation. The webhook fires within seconds of the PSP authorisation. Your AI agent reads the status and continues the call.
The same integration works for human agents via an agent-assist interface. Same API, same DTMF flow, same PCI boundary. You build the integration once and it serves both your AI and human agent channels.
Summary
AI voice agents are fully capable of handling payments — but the architecture has to be right. The LLM cannot hear or process card data. Speech recognition of card numbers creates a compliance catastrophe. DTMF capture with a dedicated payment layer keeps card data entirely out of your platform.
The architecture is: 1. AI agent handles conversation and identifies payment intent 2. Platform initiates a payment session via API 3. Payment layer takes control of card capture via DTMF 4. Card data never enters your platform, your recordings, or your AI pipeline 5. Payment layer handles tokenisation and PSP routing 6. Webhook returns success/failure — AI agent resumes the call
PCI scope stays with the payment layer. Your engineering team stays focused on your product. Your enterprise customers use their existing PSPs.
FAQ: AI Voice Agent Payments
Can AI agents handle PCI-compliant payments?
Yes. An AI voice agent can take a card payment during a call and stay PCI-compliant, as long as the card data never enters the AI pipeline. The agent runs the conversation, then hands off to a PCI DSS Level 1 certified payment layer that captures the card via DTMF keypad tones in an isolated environment, charges it, and returns only a masked result. The AI model never sees or hears the card number.
Can voice agents process payments and complete transactions?
Yes. A voice agent can take the customer through to a completed, authorised payment in the same call. The agent confirms the amount, the payment layer captures the card via the keypad, the transaction is authorised against your gateway, and the agent confirms success, with no transfer and no callback.
Which voice AI tools support PCI-compliant payments over the phone?
Most voice AI platforms, including Retell, Vapi, Bland, Synthflow, ElevenLabs, PolyAI, and Cognigy, do not process card payments natively. They run the conversation; a dedicated payment layer handles the card. Shuttle adds PCI-compliant in-call payment capture to any of these platforms, taking the card via DTMF or an SMS payment link and routing it to 30+ payment gateways. See the per-platform guides linked below.
Do PCI DSS and PSD3 apply to AI agent payments?
Yes. PCI DSS applies the moment cardholder data is captured, whether a human or an AI agent is on the call. PSD3 and strong customer authentication (SCA) requirements apply to the underlying transaction. Using a PCI-certified payment layer keeps the card data, and therefore most of the compliance burden, off your systems and on the provider.
What PCI rules should I watch out for when letting an AI agent collect payments?
The main risks are card data landing in call recordings or transcripts, DTMF tones reaching your transcription or LLM, and card numbers being written to your CRM or logs. Avoid all three by capturing the card in an isolated PCI environment and suppressing the tones from the audio that reaches your stack. Done correctly, your scope drops from SAQ-D to SAQ-A.
Can an AI voice agent split a past-due balance into a payment plan during the call?
Yes. For collections and bill-pay, the agent can agree a payment plan, take the first instalment immediately via DTMF, and tokenise the card for scheduled future charges. See AI Voice Payments for Debt Collection for the collections-specific workflow.
Related Reading
Embedded Payments for CCaaS — the business case for CCaaS operators
The CCaaS Payments Revenue Opportunity — the math on monetising payment volume through your platform
AI Payment Security: How AI Agents Handle Card Data — broader AI agent PCI guide
Voice AI Is Booming — But Can It Take a Payment? — the market context
Agentic Payments in 2026: The Infrastructure Guide — the broader infrastructure landscape for AI agent payments
Chat Agent Payments — the chat channel equivalent
What Is Embedded Payments? — the fundamentals
PCI-Compliant Payments for Contact Centres — the broader contact centre payment guide (human agents, AI agents, evaluation criteria)
Secure Payment Collection for Debt Agencies — AI agent payments applied to debt collection
PCI-Compliant Payment Architecture for Insurance Platforms — zero-scope architecture for insurance voice payments
What Are Voice Payments? — the complete guide to IVR, agent-assisted, and AI voice payments
AI Voice Payments for Hotels & Travel — voice payment architecture for hotel chains and OTAs
How Voice AI Ordering Platforms Handle Payments — PCI challenges specific to restaurant voice AI (SoundHound, ConverseNow, Kea)
The AI Payment Consent Problem Nobody Is Talking About — why approval prompts controlled by agents are a social engineering surface
Agent-Native Checkout: Why AI Commerce Needs New Payment APIs — direct programmatic checkout interfaces for AI agents
Talkdesk Payments — PCI-compliant payment capture for Talkdesk and Autopilot
RingCentral Payments — secure voice payments for RingCX and RingEX
Five9 Payments — PCI-compliant payment processing for Five9
Genesys Cloud Payments — voice and IVR payment capture for Genesys Cloud
NICE CXone Payments — secure payment processing for NICE CXone
Vapi Payments: PCI-compliant payment capture for Vapi voice agents
Bland AI Payments: secure payment capture for Bland AI phone agents
ElevenLabs Payments: native Stripe vs multi-PSP for ElevenLabs agents
Synthflow Payments: native Stripe vs multi-PSP for Synthflow voice agents
Phonely Payments: PCI-compliant payment capture for Phonely AI phone agents
Sierra AI Payments: native Level 1 PCI payments vs a multi-PSP layer
Parloa Payments: native Payment Skill vs multi-PSP capture
Decagon Payments: verified PCI-compliant capture for Decagon AI agents
Kore.ai Payments: PCI-compliant in-call capture for Kore.ai agents
Yellow.ai Payments: in-call DTMF capture vs payment links on VoiceX
Regal.ai Payments: in-call capture for outbound collections calls
Voiceflow Payments: PCI-compliant capture for Voiceflow voice and chat agents
Thoughtly Payments: in-call card capture for Thoughtly AI voice agents
boost.ai Payments: multi-PSP capture for boost.ai virtual agents
Gupshup Payments: native UPI and WhatsApp Pay vs multi-PSP card capture
Ada Payments: PCI-compliant capture for Ada AI voice and chat agents
Lindy Payments: in-call capture for Lindy Gaia phone agents
Ready to add PCI-compliant payments to your voice agents?
Book a Demo | See How It Works