**One of the most crucial challenges the telephone industry faces is source separation – making speech clear and easy to understand while minimising the bandwidth required. **

This is not a unique problem, however, with every industry from marketing to engineering wanting to solve the removal of unwanted background noise from audio clips. Some methods include adding redundant microphones to subtract noise from recordings, identifying spatial locations of sounds to distinguish important sources, and deep learning approaches incorporating video, but in this blog, we will focus on single-channel recordings.

The conundrum of speech de-noising is at the centre of a great deal of academic research thanks to its connection to telephony, and we therefore had a lot of material to learn from. Using machine learning to test out different technologies and methods to separate noise, we investigated a number of algorithms, with promising results.

This blog details our test-and-learn approach, and how – as with many of our projects – AI has been crucial in enabling us to develop affordable and efficient solutions at scale.

### Different approaches

**So, what can we do to remove background noise from audio clips, to make them easier to transcribe and interpret?**

Based on our research, we identified a number of well-established approaches:

*Wiener filtering*is a signal processing approach which takes expected statistical properties of a signal and filters noisy signals to conform. This is particularly suitable where the noise spectrum does not overlap with the signal, such as equipment noise or static caused by radio interference.*Spectral subtraction*attempts to model the spectrum of the noise, and then subtracts this from the signal. Since this assumes the noise signal can be averaged over time, this works well if the noise is stationary (its spectrum is largely constant over time), such as crowd noise or background hums.*Non-negative matrix*factorisation improves on spectral subtraction by taking a sample of the noise signal and training a dictionary of noise spectra. We then match these against the signal and reconstruct the remainder of the signal. This was developed to address problems like wind noise, which varies with time and overlaps with the speech signal, but has a relatively small set of characteristic sounds.*Computational Audio Scene Analysis*is a family of techniques, which aims to model the characteristics of human hearing. The goal of this research, aside from understanding the biology, is to replicate the human ability to choose between multiple overlapping signals, the “cocktail party problem”. These techniques tend to be complex and contain many stages.*Audio fingerprinting*deals with a different but related problem – how to pick out specific known sound signals from a recording, such as recognising a popular music track playing in a video. These techniques can be used in detecting copyrighted music in uploaded videos.

### Obtaining data

In order to de-noise audio clips at scale and speed, we needed to use machine learning algorithms to assess example datasets to create the right technology. The algorithms need training in order to learn how to separate noise, which in turn helped us create the data modelling that allowed us to create the software for de-noising.

Therefore, a good dataset was crucial to assess the project’s effectiveness. Fortunately, the Signal Separation Evaluation Campaign (SiSEC) has published a standard dataset for assessing source separation in the presence of real-world background noise, which was perfect for our needs. This dataset is a collection of 10-second audio clips of speech superimposed on a range of real-world noisy environments.

For each case, the dataset contains a clean speech clip, a clip of just the noise, and a clip combining the two. The mixtures are actually synthetic – the speech and noise are recorded separately and combined, because the separate tracks are needed for assessing algorithms. This does mean that these clips can only model *additive noise*, with no non-linear interactions between the noise and speech. This was a theoretical assumption of all our models, but did mean that our tests would not account for some phenomena, such as *microphone saturation*.

### Implementation

Based on this research, we decided to implement the spectral subtraction and non-negative matrix factorisation algorithms – these enabled us to look at the effectiveness of these approaches in real-world scenarios. Future work could look into more specialised and modern developments on these approaches, as well as optimising the implementations and tuning their hyperparameters.

A key concept in this kind of work is the Short-Time Fourier Transform – this applies a windowing function to divide the audio sample into overlapping “buckets”, then applies a Fourier Transform to each to obtain a frequency spectrum. This gave us a representation of the signal in *time-frequency space*:

### Spectral subtraction

Spectral subtraction takes a sample noise signal and measures an average spectrum for the noise over time. This is then subtracted from the mixed signal, and what remains is the speech part.

Firstly, we took the STFT of the noise sample, and for each frequency took the average magnitude across all time buckets - this gave us a time-independent average noise spectrum. Then, we took the STFT of the mixed signal, and for each time step subtracted the magnitude of the noise spectrum at each frequency from the measured magnitude. If this came up negative, we set the magnitude to 0, since it couldn’t be negative. Finally, we inverted the STFT to recover the remaining signal.

This approach works well for *stationary noise*, but it runs into problems when the noise intensity drops – when this happened, we had to set the magnitude to 0. This means we lost some of the sound information, but because it is a non-linear operation it also led to characteristic “tinny” artefacts in the audio, which can interfere with intelligibility of speech. One approach to improving this is *over-subtraction* – rather than simply removing the noise spectrum, we removed a multiple of it. This tended to wipe out frequency bands where noise dominates. We lost speech information but generated fewer audible artefacts, which could lead to a more intelligible result. This could be treated as a hyperparameter to be tuned to balance noise reduction against distortion.

### Non-negative sparse coding

A more sophisticated approach is to try to model the noise signal by learning a dictionary of spectra, then attempting to encode the noise in the mixed signal using this dictionary. This approach was taken by Schmidt, Larsen & Hsiao in **Wind Noise Reduction** **using Non-negative Sparse Coding**, and we implemented the algorithm described there.

This approach is appropriate when the noise signal is non-stationary and overlaps the frequency range of speech, but is relatively simple and can be easily modelled – wind noise is a perfect example. However, this is unlikely to work well for overlapping speech (i.e. the cocktail party problem), where the noise cannot be easily modelled, or the model is likely to match the clean signal.

In this approach, the aim was to construct sparse matrixes H and W such that HW reconstructs X, the time-frequency matrix for the mixed signal’s STFT. In particular we minimised the mean-square distance between HW and X, subject to the constraint that all the activations must be non-negative.

Performing this factorisation gave us W, a time-invariant dictionary of noise spectra, and H, an encoding of the signal using this dictionary. This is a known problem in linear algebra, and Schmidt implements a process described by Eggert in **Sparse coding and NMF**. Eggert’s procedure solves this iteratively, alternating between updating H and W iteratively. This converges in a small number of steps, in a process similar to gradient descent algorithms.

Once the noise training was complete, we extended the dictionary W with a new set of vectors and repeated the training procedure on the mixed signal, with the constraint that the noise vectors are not updated in this phase. This allowed us to learn a new set of spectra which encode the voice part of the signal, while keeping them separate from the part of the encoded noise.

For our implementation, we found the training time in this process prohibitive – to perform this process for a 10-second clip took several minutes, and this limited the number of training steps possible. This was particularly slow because the summations in the update rule for W required a triply-nested loop, which is computationally expensive.

However, Schmidt’s conclusions show that the quality of the results work best when the sparsity parameters are set to 0 – removing the need for sparsity parameters allowed us to come up with a simplified version.

### Non-negative matrix factorisation

In his paper, Eggert also describes another factorisation algorithm which does not take sparsity into account (see Eggert, section II). This more naïve approach is much simpler because we didn’t need to include a cost function to encourage sparsity, and this means we didn’t need to worry about *normalising* the dictionary entries. This gave us a much simpler update rule, and crucially allowed us to replace the nested loop with simple dot product.

With this simplification we got much faster training, and could afford to carry out many more training steps to achieve convergence. This approach gave a much cleaner result, with far fewer artefacts.

This left us a number of hyperparameters we can tune:

- Size of the noise and signal dictionaries
- Number of training steps for noise and signal steps.

### Silence detection

All of these techniques have focused on modelling a pure noise sample, but we have not addressed how to obtain this sample – for these experiments we have just used the known noise files, which does risk overfitting. This would need addressing before creating a practical implementation of this. There are a couple of options for this:

One is to make this a manual step that an operator can select a time period with no speech, and this is used as a basis. Another is to assume that any recording will contain a period of silence before the speech starts – we can therefore use this as a sample. This does not need to be long – the research recommends ~100ms, depending on the recording. This carries the risk that the speech starts early (and we may end up removing important segments), and that the early part of the recording is not representative of the noise encountered in the main section.

A more sophisticated approach is to use a separate “speech detector” component, and take the non-speech parts of the recording for modelling. Of course a speech detector is itself a complex algorithm, and subject of much research! Fortunately, telephony has developed some simple heuristics which can be used for this.

### Conclusion

We investigated a number of algorithms for practical speech separation, and had some promising results. Spectral subtraction is a good approach for constant background noise, but creates artefacts which might interfere with intelligibility. Non-negative matrix factorisation (NNMF) is a more sophisticated approach which can lead to better results, but adds more parameters which may need tuning. In either case, we will need to investigate how to obtain representative noise samples.

At **CACI Information Intelligence**, the Machine Learning Guild constantly explores ways to solve some of the world’s most challenging technical issues, and this project is a great example of our approach. Finding the best ways to perform speech separation through audio de-noising at scale, while remaining affordable and within suitable timeframes, has powerful applications in real life. Speech de-noising is just one example of how artificial intelligence is enabling solutions that would otherwise be far too expensive, painstaking and time- consuming to create, showing the possibilities of what machine learning can achieve when applied to the right projects and solutions.