ASE-TM (Active Speech Enhancement Transformer-Mamba) model

In this work, we introduce a new paradigm for active sound modification: Active Speech Enhancement (ASE). Active Noise Cancellation (ANC) algorithms focus on suppressing external interference, while ASE goes further by actively shaping the speech signal—both attenuating unwanted noise components and amplifying speech-relevant frequencies—to improve intelligibility and perceptual quality. To enable this, we propose a novel Transformer-Mamba-based architecture, along with a task-specific loss function designed to jointly optimize interference suppression and signal enrichment. Our method supports multiple speech processing tasks—including denoising, dereverberation, and declipping.

Model

Active denoisng task

Samples from the VoiceBank-DEMAND test set (T60 = 0.25s and λ2 = ∞). The first audio column, labeled Noisy-speech, represents speech with additive noise, without any enhancement algorithm applied. The second column, Clean-speech, provides the ground truth clean speech signal for reference. The next column, ASE-TM, presents the enhanced signal produced by our proposed model. The following columns, ARN, DeepANC, and THF-FxLMS, display results from the ARN, DeepANC, and THF-FxLMS methods, respectively, each adapted to the speech enhancement task for the same input signal. Noisy-speech and Clean-speech signals are after the primary path.

Noisy-speech Clean-speech ASE-TM ARN DeepANC* THF-FxLMS

Active dereverberation task

Samples from the VoiceBank-DEMAND test set (T60 = 0.25s and λ2 = ∞) with RIRs applied to generate the Reverbed-speech. The first audio column, labeled Reverbed-speech, represents speech with RIRs applied, without any enhancement algorithm applied. The second column, Clean-speech, provides the ground truth clean speech signal for reference. The next column, ASE-TM, presents the enhanced signal produced by our proposed model. The following columns, ARN, DeepANC, and THF-FxLMS, display results from the ARN, DeepANC, and THF-FxLMS methods, respectively, each adapted to the speech enhancement task for the same input signal. Reverbed-speech and and Clean-speech signals are before the primary path.

Reverbed-speech Clean-speech ASE-TM ARN DeepANC* THF-FxLMS

Active declipping task

Samples from the VoiceBank-DEMAND test set (T60 = 0.25s and λ2 = ∞) with clip value of 0.25 to generate the Clipped-speech. The first audio column, labeled Clipped-speech, represents speech with clipping applied, without any enhancement algorithm applied. The second column, Clean-speech, provides the ground truth clean speech signal for reference. The next column, ASE-TM, presents the enhanced signal produced by our proposed model. The following columns, ARN, DeepANC, and THF-FxLMS, display results from the ARN, DeepANC, and THF-FxLMS methods, respectively, each adapted to the speech enhancement task for the same input signal. Clipped-speech and and Clean-speech signals are before the primary path.

Clipped-speech Clean-speech ASE-TM ARN DeepANC* THF-FxLMS

* The DeepANC model is not open-source; therefore, we re-implemented it based on the available descriptions. This may explain the observed reduction in performance. Additionally, all baseline models were adapted to the active enhancement setting, which could also contribute to their suboptimal performance.