Introduction:
We have seen text data , image data in our day to day life but have you thought about how audio data looks like. Is it a simple audio file like .mp3, .wav etc.. and how can we use them for audio analysis like audio classification, speech recognition, Environment sound classification , chatbot like Alexa etc. This series of blogs will give you an insight about all that.
What Is Edge AI?
Edge AI is an On-device AI for mobile, web, and embedded applications. One can easily perform common vision, audio and language tasks using the included pipeline or build a personalized one based on need basis.
Audio data format:
We know that ML only understands numbers ie. The representation of any data will be real numbers which will be fed into ML or DL models . Even if your data is text data or image data or audio data it is converted into specific format (ie real numbers ) which can fed into model as input. But the catch is representation of text data , audio data and image data might be different and each representation has its own advantages and disadvantages. I know the previous line would be confusing. Let me pull up an example.
1. Audio data representation as real number:
Similar to texts and images, audio is unstructured data meaning that it’s not arranged in tables with connected rows and columns.Usually audio data will be recorded in analog format(ie Physical Tape,disk etc..) but it will be converted into digital format.Why is that?.Digital is easy to share, edit, and you can even manipulate the data without hassle.How to manipulate?.You should definitely see what is fourier series and fourier transform.Now coming back , let me give you a short insight of how analog and digital data looks like in the diagram -1 below.
(Note: From the above diagram we can see that red curve is the analog format data and the blue vertical bars over the red curve is the digital format data which is obtained by approximation of analog curve using fourier series or fourier transformation.)Let me give a difference between analog audio vs digital audio data in the diagram-2 below. From this we can clearly see that digital sound wave may look slightly different from analog audio but its a very close approximation where much of the information about the audio still exists and we can easily manipulate this digital audio data easily.How is that?. From the digital audio data , we can get a bunch of zeros and ones which is nothing but a representation of digital audio data.
Now coming to the main part, Audio data can be represented in many different ways like Raw form, STFT, FFT, MFCC, Chromagram, Mel-Spectrogram etc.. Each have its own advantages and disadvantages . We will discuss in the upcoming blog the mostly used representation of audio data. ie STFT and MFCC.
How Edge AI Assists With Audio Classification
Leveraging the capabilities of AI and deep learning, one can analyze a visual scene in all its elements including audio. Here are some scenarios on how Edge AI can assist with audio classification.
Audio Scene Classification
Audio Scene Classification allows for discerning environments to enable specific functionalities, like noise reduction tailored to a location or voice interfaces. For instance, it can deactivate touch or typing capabilities on a smartphone in a car (driver mode).
Audio Event Detection
Audio Event Detection involves identifying specific sounds such as a baby crying, glass breaking, or a gunshot, which can trigger actions such as notifications or location triangulation. AI at the Edge is particularly advantageous here due to its rapid response in identifying these critical audio events amidst overlapping sources. This capability is crucial in scenarios where immediate recognition of events like approaching vehicles or screeching brakes can be lifesaving.
Before we begin , we must know some basics of Audio characteristics.
Sound is a wave of vibrations traveling through a medium, such as air or water, and finally reaching our ears. When analyzing audio data, three key characteristics are considered: time period, amplitude, and frequency.
- Time Period: This refers to the duration of a sound or how long it takes to complete one cycle of vibrations, measured in seconds.
- Amplitude: Amplitude indicates the sound's intensity, measured in decibels (dB), which we perceive as loudness.
- Frequency: Frequency, measured in Hertz (Hz), denotes the number of sound vibrations per second. It is interpreted by humans as pitch, with low frequencies perceived as low pitch and high frequencies as high pitch.
While frequency is an objective parameter, pitch is subjective. The human hearing range spans from 20 to 20,000 Hz. Scientists state that most people perceive sounds below 500 Hz (like the roar of a plane engine) as low pitch, and sounds above 2,000 Hz (like a whistle) as high pitch.
Fourier series:
The following diagram represents a digital audio signal. Suppose this data is sampled at 16kHz, meaning there are 16,000 amplitude values in one second (consider these as single vectors, resulting in 16,000 vectors). For a 10-second audio clip, the total amplitudes will be 160,000, which is quite large. How can we extract the necessary information from this vast set of amplitudes? This is where Fourier series and Fourier transform come into play.
In simple terms, the Fourier series is a mathematical representation (a sum of sine and cosine functions) of periodic waves (audio sound) in the Time-Amplitude domain.
Fourier transform:
The Fourier transform is a mathematical representation of aperiodic (non-periodic) waves in the Frequency-Amplitude domain. How does it work? Simply put, the Fourier transform takes a time-domain signal and converts it into a frequency-domain representation.
The Fourier transform computes the complex amplitudes of individual frequency components, revealing the frequency content of an entire signal. The resulting spectrum provides a detailed view of the signal's frequency content but does not indicate how this frequency content changes over time.
Here's how individual frequencies look in the Fourier transform:
In layman's terms, an audio signal consists of several single-frequency sound waves (in the above diagram, we see three single-frequency sound waves). In the time-amplitude domain, we only see the combined red squiggly wave, which results from adding the amplitudes of all other waves at different frequencies at each time step. The Fourier transform helps by decomposing a signal into its individual frequencies and the corresponding amplitudes in the Frequency-Amplitude domain.
But what if we want to find out how different frequencies change over time? This is where the Short-Time Fourier Transform (STFT) is used.
Key benefits of Edge AI in audio classification
Real-Time Processing: Immediate classification of sound events allows for quick responses in applications like voice assistants and emergency detection systems.
Reduced Latency: Eliminates delays associated with sending data to remote servers, providing faster results for time-sensitive applications.
Enhanced Privacy: Local data processing reduces the risk of data breaches by keeping audio data on the device.
Lower Bandwidth Usage: Minimizes the amount of data transmitted over the network, beneficial in environments with limited bandwidth.
Operational Independence: Enables devices to function without a continuous internet connection, ensuring reliable performance in areas with unstable connectivity.
Efficient Resource Utilization: Leverages the device’s processing power, reducing reliance on cloud infrastructure and leading to cost savings.
Summary
We are only in the early innings of Edge AI and their possible applications are endless. This rising field can certainly elevate the experiences and help with outcomes that are unimaginable. If you have a specific need or have a need to explore the possibilities, feel free to reach out to us at ideas2it.com.
Credit Links:
- https://medium.com/analytics-vidhya/simplifying-audio-data-fft-stft-mfcc-for-machine-learning-and-deep-learning-443a2f962e0e
- https://towardsdatascience.com/all-you-need-to-know-to-start-speech-processing-with-deep-learning-102c916edf62
- https://www.quora.com/Digital-Signal-Processing-What-is-the-relationship-between-spectrum-and-spectrogram
- https://www.youtube.com/watch?v=4_SH2nfbQZ8