Each cat has its own unique vocabulary to communicate with their owners consistently when in the same context. For example, each cat has their distinct meow for “food” or “let me out.” This is not necessarily a language, as cats do not share the same meows to communicate the same thing, but we can use Machine Learning to interpret the meows of individual cats.
In this article, we provide an overview of the MeowTalk app along with a description of the process we used to implement the YAMNet acoustic detection model for the app.
MeowTalk Project and Application Overview
In this section, we provide an overview of the MeowTalk project and app along with how we use the YAMNet acoustic detection model. In short, our goal was to translate cat vocalizations (meows) to intents and emotions.
We use the YAMNet acoustic detection model (converted to a TFLite model) with transfer learning to make predictions on audio streams while operating on a mobile device. There are two model types in the project:
- A general cat vocalization model (detects a cat vocalization);
- Specific cat intent model that detects specific intents and emotions for individual cats (e.g. Angry, Hungry, Happy, etc.).
If the general model returns a high score for a cat vocalization, then we send features from the general model to the cat-specific intent model.
I highly recommend you become familiar with the YAMNet project — it is incredible. YAMNet is a fast, lightweight, and highly accurate model. You can find all the details about installation and set up in the TensorFlow repo.
Next, I will go over the main stages of the model development, training, and conversion to the TFLite format.
How to change the YAMNet architecture for transfer learning
The YAMNet model predicts 512 classes from the AudioSet-YouTube corpus. In order to use this model in our app, we need to get rid of the network’s final Dense layer and replace it with the one we need.
Last layers of the model
As you can see in the image, we have a global average pooling, which produces tensors of size 1024. To train the last dense layers of the network, we have to create a set of inputs and outputs.
Firstly, we need to choose the type of input. I propose to train the last layers with our training data and connect them to the YAMNet model after the training. Furthermore, that means we will extract YAMNet features from our audio samples, add labels to each feature set, train the network with these features, and attach the obtained model to YAMNet. Depending on the length of the audio sample, we will get a different number of feature vectors. According to the “params.py” file, we have the following properties:
According to the picture, if we have a two-second audio sample, we will get four feature vectors from the YAMNet model.
How to get features from YAMNet?
UPD: After the last update, the authors add feature extraction to the output, so we do not need to change the structure.
Outputs of the model are:
- predictions — scores for each of 512 classes;
- embeddings — YAMNet features;
- log_mel_spectrogram — spectrograms of patches.
To create the training dataset we need to create a set of embeddings paired with the label. We assume that all the sounds from the file belong to one class and samples of each class store in a directory named as this class. But you can easily change the pipeline for the multi-label classification problem.
After this step, we have a training dataset.
Note: I advise you to implement silence removal to improve the training process if your audio files contain more than one needed sound. It increases accuracy significantly.
Training the model
First of all, we need to generate the model. We assume that each cat audio sample has only one label. I propose to create two dense layers with softmax activation. During the experiment stage, we concluded that this is the best configuration for our task. You can freely change the network structure depending on your experiment results.