
Teaching AI to detect answering machines might seem like a straightforward task at first. After all, for a human listener it is usually easy to tell the difference between a prerecorded message and a real person speaking on the line. What we often underestimate, however, is how many signals our brain processes in just a fraction of a second. When listening to a call, we rely on contextual cues, pauses, and speech patterns to understand whether we are hearing a voicemail greeting or a real conversation. Reproducing this type of audio understanding is much more challenging for AI. Here’s how we built MightyCall’s AI-powered answering machine detection (AMD) system with industry-leading 97% accuracy in production, which refers to the system correctly identifying whether a human or voicemail system answered a call. We walk through the key stages of this work: how we selected training data, how the model evolved through multiple iterations, the challenges we faced along the way, and the responsible approach we followed throughout the development process. Why traditional AMD models fail in production Many traditional template-based answering machine detection systems were designed before AI became widely used. Instead of learning from live call traffic, these systems rely on a set of fixed rules that attempt to predict whether the call was answered by a human or a voicemail system. Nikolai Kalinin, MightyCall Product Manager, commented: “Typically, such models analyze several predefined signals. These may include the timing of the response after the call is answered, the duration and structure of the first phrase, or certain patterns of speech and silence. Some systems also monitor the presence of DTMF (Dual-Tone Multi-Frequency) signals, which can appear when a phone system interacts with automated menus.” In some cases, detection logic is also built around libraries of common voicemail prompts. These prompts are broken down into expected combinations of speech and silence that the system tries to match during the first seconds of the call. While this approach can work reasonably well with standard voicemail greetings, it becomes unreliable in real-world conditions. Custom voicemail prompts recorded by individuals often differ significantly from predefined patterns, and the same issue appears with modern AI voice assistants, whose speech structure and timing do not follow the predictable templates that rule-based systems expect. To solve these limitations, we needed a detection approach that could learn from real call behaviour instead of relying on static rules. This was what led us to build MightyCall’s AI-powered AMD system. Reaching 97% accuracy in production AI AMD Developing the AI AMD model was not a single breakthrough but a sequence of iterations based on real production observations. Step 1: Building the baseline model The first step focused on establishing a reliable baseline. We started with a model trained on typical voicemail prompts, which allowed the system to reach over 80% accuracy from the very first training cycle. Step 2: Teaching the model real voicemail behavior In the second iteration, we collected hundreds of real voicemail recordings from production calls and used them to retrain the model. This additional dataset alone produced a noticeable improvement, resulting in an increase of roughly 10% in detection quality on the evaluated samples. Step 3: Solving the hardest cases The final step focused on edge cases that continued to confuse the system. These included difficult samples with noise, interruptions, distorted speech, short greetings, and other unusual patterns where the voicemail prompt was still understandable but harder to detect. After correctly labeling this data and retraining the model, the system achieved an industry-leading 97% accuracy. “The system evolved through rapid iteration. Early versions were intentionally simple prototypes that allowed the team to observe call patterns and quickly improve the model based on failures.” — Peter Vlasov, CTO at MightyCall Reaching that level of accuracy required more than model iteration alone. It also depended on the quality of the training data, the way we prepared it, and the safeguards we put in place around its use. Training on real call center data (and doing it responsibly) Real call center traffic involves background noise, dropped audio, and distorted speech that can interfere with the signal, causing voicemail greetings to sound fragmented or uneven and making them harder for the detection system to recognize. Once we exposed the model to real production traffic, several recurring failure patterns quickly became clear. The main issues we came across during the early stages of MightyCall's AI AMD development: Custom voicemails Recorded by a person with typical characteristics that can make AI AMD think it’s a real person talking. IVR (Interactive Voice Response) aka voice menus Voice menus also turned out to be a part where the system would struggle, as those audios are almost always custom recordings. Audio interference The system would think it’s a real conversation going on if it hears extraneous sounds, but those also appear because of call quality issues or bad recording. Complete silence There’s nothing to detect if no sound is produced, which can happen in both scenarios: the person picking up (probably by mistake) and saying nothing or a glitch in the voicemail configuration, which leads to silence instead of the actual message. Non-standard voicemail greetings Some voicemail greetings include music, song fragments, or even cartoon audio. These unconventional formats lead to misclassification in both directions, with voicemails detected as a human response or background audio mistaken for a voicemail. Once we identified these as the main issues causing drops in accuracy, they became the basis for the entire plan to improve MightyCall’s AI AMD. How we prepared and handled high-quality training data The next step was to prepare training data that accurately reflected these real-world scenarios while remaining structured enough for effective model training. What we did was: Curated training dataset Instead of relying on massive datasets, we built a carefully curated collection of roughly 400 voicemail recordings that represented the most challenging real-world detection scenarios. Nikolai commented: “The preparation of the dataset was handled very carefully, because a reliable input ultimately leads to a more reliable output. The quality of training data matters more than its volume.” Manual verification of recordings We reviewed incorrectly detected recordings one by one, listening to them to confirm the type of greeting and identify the cause of the model’s confusion. Removing leading silence and selecting short audio segments Many voicemail greetings begin with a short silent pause. This leading silence was removed before training. After that, only the first 2–3 seconds of the greeting were used as the training sample. This approach helped us standardize the input, bringing different voicemail formats to a consistent structure regardless of whether they originally included silence at the beginning or not, while also focusing the AI AMD on the first seconds of the greeting and enabling voicemail detection within two seconds. Training the system on real call center data ultimately allowed us to improve AMD accuracy in ways that directly benefit real customers. How we approach responsible AI and data privacy Using anonymized audio data The training process follows a strict data minimization approach. The model was trained on small WAV audio fragments that contained no information about the caller, client, or phone numbers involved. No additional metadata, such as call time or other call details, was included in the training samples. No voice profiling or biometric identification The detection model analyzes acoustic patterns associated with voicemail greetings and answering machine behavior. It does not attempt to identify speakers, generate voiceprints, or perform biometric recognition. Processing within MightyCall’s internal infrastructure All model training and data processing take place inside MightyCall’s internal environment, ensuring that the audio data remains within a controlled and secure system. No external AI training services Model training and evaluation are performed within MightyCall’s internal infrastructure. Customer call data is not sent to external AI training platforms or third-party model providers. By combining real call data with careful preprocessing and responsible data handling, we built a training dataset that reflects real production conditions and led to 97% voicemail detection accuracy. What’s next for the AI AMD model While the current system already achieves 97% detection accuracy, we continue refining the model using a human-in-the-loop approach, where uncertain calls are reviewed and used for further training. By analyzing these cases, we can identify patterns and gradually introduce threshold-based logic. For instance, if the model predicts “voicemail” but the confidence score drops below a certain level, such as 70%, the system can treat the call as potentially human and route it to an agent instead. Peter commented: “We still follow the same approach: we listen to the recordings ourselves and closely monitor the model’s confidence scores. When we see calls where the model is less certain, we review those recordings and use them to understand what went wrong and improve the system.” The long-term target is to reach “three nines” accuracy — 99.9% correct detection for MightyCall’s AI AMD. Another practical limitation relates to language coverage. The current system focuses primarily on English, since the base model it builds on was trained on English voicemail prompts. Spanish recordings are occasionally included in the training data, but English remains the primary language for detection. As more multilingual samples are introduced during training, detection accuracy for Spanish and other languages is expected to improve as well. The rapid adoption of AI technologies is also creating new challenges for answering machine detection. Voicemail systems increasingly use cloned voices, making them almost indistinguishable from a real human response. Addressing such cases will likely require an additional detection layer designed to identify AI-generated voice prompts. Continuing to adapt the system to these emerging patterns will remain a key part of how we improve detection accuracy going forward. :::tip This article is published under HackerNoon's Business Blogging program. ::: \
View original source — Hacker Noon ↗

