Transcribe Audio

How to Use the “Transcribe Audio” Function in Botassium
If your users are sending voice notes or audio recordings on WhatsApp, the “Transcribe Audio” function in Botassium allows you to automatically convert those audio messages into written text—with language detection built in. This makes it easy to analyze, store, or act on voice input just like typed messages.
This guide explains how to configure transcription and use the result in your automation flow.
What is the “Transcribe Audio” Function?
The Transcribe Audio node is a function that listens to the most recent incoming audio message, detects its spoken language, and converts the content into text. The transcribed text is saved into a named variable, which can then be used in later nodes for decisions, replies, or storage.
This is ideal for:
Supporting users who prefer to speak rather than type
Capturing voice-based form responses
Logging audio content in a readable format
Combining voice input with AI or condition logic
How to Set It Up
Add the “Transcribe Audio” Node
Place this node immediately after an event that receives audio, such as Attachment Message Received or Voice Note Detected.Set the Variable Name
Enter the name of the parameter where the transcribed text should be stored.
Example:transcribed_text
,user_voice
,voice_input
Continue Your Flow Using the Transcription
Once stored, you can:Insert the variable into a message: “You said: @transcribed_text”
Pass it to an AI node
Use it in conditions (e.g., detect keywords)
The transcription process supports auto-detection of the language, so you don’t need to specify the spoken language beforehand.
Example Use Case
You're collecting customer feedback via voice note:
User sends a voice message
Transcribe Audio node runs and stores the result in
feedback_text
A follow-up message says: “Thanks for your feedback! You said: @feedback_text”
Optionally, the text is stored or analyzed further
The Transcribe Audio function works in all WhatsApp automation flows, and is especially powerful for voice-first experiences, audio surveys, multilingual feedback collection, and accessible interaction design where speaking is faster or easier than typing.