DOA (Direction-of-Arrival) for Conference Speakerphone

6 minutes

DOA refers to the direction of arrival; it is to determine the speaker's location based on the voices collected by Miles. DOA has at least two purposes, 1) for the display of azimuth lights to enhance the interaction effect; 2) as a precursor task for beamform to determine the parameters of spatial filtering. DOA is commonly used on many bluetooth speakerphones and video conferencing cameras' microphones. In this article, we will further look into the principle of DOA and the method of separating the sound of speech and non-speech segments.

What is Direction-of-Arrival (DOA)?

It is the direction in which a propagating wave usually arrives near a set of sensors. The sensors are arranged in an array called a sensor array. Beamforming is often associated with beamform, which estimates the signal from a given direction. It has critical applications in human-computer interfaces such as video conferencing, speech recognition and enhancement.

The principle of DOA is to pick up the information present in signals picked up by microphones. The accuracy of the estimated DOA is higher with a more significant number of microphones, which naturally impose heavy computation. Therefore, the array size rapidly increases with the number of microphones.

There are two different types of DOA, and they use other methods: TDOA (Time Difference of Arrival) & Capon DOA (Capon Direction-of-Arrival).

1. Time Difference-of-Arrival (TDOA)

The principle of TDOA is simple, the number of microphones used is small, the amount of calculation is small, and it is easy to implement. The main core of TDOA is to calculate the time difference between the sound source and each microphone, which requires the system's sampling rate to be high enough. In addition, this method usually calculates the time difference through a cross-correlation or generalized cross-correlation algorithm, which may not be suitable for periodic signals.

Also, there are layout requirements for the microphone arrays; all the microphones would require to be placed facing the same direction to calculate the sound source location accurately. If the microphones are placed facing different directions, they won't be able to locate the sound source.

2. Capon Direction-of-Arrival (DOA)

Capon DOA (Capon Direction-of-Arrival) is a beamforming method that forms a beam by filtering, weighting and superimposing the signals collected by each sensor in the array, scans the entire receiving space, and visualizes the sound pressure distribution on a plane. The beamforming method is robust, does not require prior knowledge, and is simple and intuitive to use, so this type of device is also vividly called an acoustic camera.

It uses a different calculation method than the TDOA. Instead of calculating the time difference between each microphone to acquire the sound source location, the capon DOA shoots a beamform at a 360-degree angle, searching for audio. It is like using a flashlight in the dark, flashing the lights around the room.

2.1 Capon DOA Enhancement

The capon DOA is already an excellent and accurate sound source localization, but adding voice activity detection (VAD) can further improve the accuracy of separating speech audio and background noises. Therefore, the device can more accurately locate the present speaker, pick up the audio and filter the background noises.

2.2 Voice Activity Detection (VAD)

The purpose of voice activity detection is to accurately detect the starting position of the speech segment of the audio signal to separate the speech segment and the non-speech segment (silence or noise) signal. Since it can filter out incoherent non-speech signals, an efficient and accurate VAD can reduce the computational load of subsequent processing and improve the overall real-time performance.

VAD algorithms can be roughly divided into three categories:

  • Threshold-based VAD: The traditional VAD method can separate speech from non-speech by extracting features in the time domain (short-term energy, short-term zero-crossing rate, etc.) or frequency domain (MFCC, spectral entropy, etc.).
  • VAD as a classifier: The detection of speech can be viewed as a two-classification problem, in which speech and non-speech are classified, and then a classifier is trained by machine learning to detect speech.
  • Model VAD: Using global information based on decoding, a complete acoustic model can discriminate speech from non-speech segments (the granularity of the model can be very coarse).

EMEET Products with Capon DOA

EMEET Meeting Capsule is the latest conference camera with a speaker and microphone; it uses 8 Omnidirectional beamform microphones. The beamform will shoot out every 10 degrees, so it will conduct 36 beamforming to pick up the audio to pick up every word within a radius of 18ft (5.5m) with high fidelity. Also, voice activity detection will conduct various calculations, and the output audio will eliminate the background noise. So the other side of the party will hear the speaker clearly with minimal background noises.

Another product would be the EMEET OfficeCore M2Max; it is a conference speaker and microphone with 4 highly-sensitive directional mics that provide up to 5-meter, 360° voice pickup, and a 48kHz sampling rate that ensures exceptional clarity and reality.

Conclusion

To sum up, people have higher expectations of the quality of a speakerphone when conducting hybrid meetings. DOA is a crucial technology for conference speakerphones, as it can locate the sound source and transmit the audio to the other party. Also, adding voice activity can eliminate the background noise, ensuring the other party can have the best user experience.

Be the First to Know

By subscribing, you agree to EMEET Terms of Use and Privacy Policy.

You May Also Like

Introducing the brand new EMEET OfficeCore Speakerphone: M0 Plus, design for remote meetings used in small to medium sized businesses.
2 minutes
A thorough explanation of EMEET's latest Voice IA Algorithm brings the audio technology to another level.
2 minutes
The speech from the received signal and process these signals with pre-designed rules to identify the sound and give feedback on the result to the user.
4 minutes