SEATTLE — Imagine being able to focus on a single speaker’s voice in a noisy crowd just by looking at them once. That’s the promise of a new intelligent headphone system developed by researchers at the University of Washington. By combining cutting-edge artificial intelligence with binaural headphones, this system enables users to zero in on a target speaker while filtering out all other voices and background noise.

The key to the system is what the researchers call “Look Once to Hear.” The user simply looks at the person they want to listen to for a few seconds. During this brief enrollment phase, the headphone captures a short, noisy binaural recording of the target speaker. From this recording, the system learns the unique speech characteristics of the target, even in the presence of interfering speakers and ambient noise. Armed with this “speaker embedding,” the device can then extract and amplify the target speaker’s voice in real time, allowing the user to focus on that person even if they look away or the speaker moves.

This technology, presented at the ACM CHI Conference on Human Factors in Computing Systems, could have wide-ranging applications. Imagine being able to hear your friend clearly in a bustling cafe, follow a tour guide’s narration in a crowded museum, or listen to a colleague during a walk down a noisy street – all without straining to pick out their voice from the cacophony. For people with hearing impairments, this technology could be game-changing, making it easier to participate in conversations and navigate noisy environments.

“We tend to think of AI now as web-based chatbots that answer questions,” says senior author Shyam Gollakota, a UW professor in the Paul G. Allen School of Computer Science & Engineering, in a media release. “But in this project, we develop AI to modify the auditory perception of anyone wearing headphones, given their preferences. With our devices you can now hear a single speaker clearly even if you are in a noisy environment with lots of other people talking.”

Methodology: From Noisy Examples to Clear Speech

To achieve this feat, the researchers tackled two main challenges. First, they needed a way to enroll the target speaker using a noisy, binaural recording rather than a clean audio sample. Second, they needed to extract the target speech in real time on an embedded device with minimal latency.

For enrollment, the team developed two approaches. The first uses a beamforming network to estimate the clean speech of the target speaker from the noisy input, exploiting the fact that the target speaker is directly in front of the user during enrollment. The second approach uses knowledge distillation, training a network to directly predict the target speaker embedding from the noisy input.

For real-time speech extraction, the researchers started with a state-of-the-art speech separation network called TFGridNet. However, this network was too computationally intensive for real-time use on an embedded device. Through a series of optimizations, including caching intermediate outputs and converting the model to an efficient format called ONNX, they were able to achieve real-time performance with a latency of just 18.24 ms.

To ensure the system could handle real-world conditions, the training data included variations in speech characteristics, room acoustics, and background noise. The researchers also fine-tuned the model to handle moving speakers and errors in the listener’s head orientation during enrollment.

Results: Robust Performance in the Wild

The researchers evaluated their system in a variety of real-world scenarios, including indoor and outdoor environments with previously unseen speakers. In a user study, participants rated the system’s ability to suppress background noise and interfering speakers, as well as the overall listening experience. The results were promising, with the knowledge distillation enrollment method outperforming the beamforming approach.

Importantly, the system was able to handle scenarios where the listener or speaker was moving, thanks to the fine-tuning with simulated motion. This robustness to movement is crucial for real-world usability.

The researchers also conducted a user study to evaluate different interfaces for the enrollment process. Participants preferred a physical button on the headphones for its clear haptic feedback and found a five-second enrollment duration to be acceptable.

Limitations and Future Directions

While groundbreaking, the system has some limitations. Currently, it is designed to focus on a single speaker at a time. Future work could extend this to support multiple target speakers by enrolling each one separately. The system also assumes that there are no other strong interfering speakers directly in line with the target during enrollment, though training could potentially mitigate this.

The speech characteristics of the target speaker are assumed to remain relatively constant between enrollment and playback. Significant changes due to factors like emotion or health could impact performance. However, the short time between enrollment and use mitigates this issue.

Looking forward, the researchers envision numerous applications for this technology. Beyond the scenarios mentioned above, it could be valuable in educational settings, allowing students to hear the teacher clearly. It could also aid professionals who need to communicate in noisy environments, such as factory floors or construction sites.

As hearable devices become increasingly sophisticated, technologies like this will open up new possibilities for augmenting human hearing. By giving users the ability to focus on what matters to them, these intelligent systems can help people navigate an increasingly noisy world. The “Look Once to Hear” system is an exciting step in this direction, showing the potential for AI to enhance our sensory experiences in powerful and practical ways.

StudyFinds Editor-in-Chief Steve Fink contributed to this report.

Leave a Reply

Your email address will not be published. Required fields are marked *