Rainforest Connection Species Audio Detection

Priteshlunkad
9 min readJan 24, 2021

Contents:

  1. Business Problem
  2. Problem Statement
  3. Source of data
  4. About the dataset
  5. Machine learning problem formulation
  6. Existing solutions
  7. First cut solution
  8. Univariate analysis on dataset
  9. Bivariate analysis on dataset
  10. Creating data pipelines
  11. Training Densenet121 model
  12. Quantization of model
  13. Deployment
  14. Kaggle scores
  15. Future work
  16. References

1. Business Problem:

In this challenge, we have to predict based on the sounds of various species of birds and frogs which species the sound will belong to. Traditional methods of assessing the diversity and abundance of species are costly and limited in space and time. So a deep learning-based approach will be very helpful to accurately detect the species in noisy landscapes. Rainforest Connection(RFCx) created the world’s first real-time monitoring system for protecting and supporting remote systems and unlike visual-based tracking systems like drones or satellites, RFCx relies on acoustic sensors to monitor the ecosystem soundscapes at different locations all year round. The system built by RFCx also has the capacity to create convolutional neural network (CNN) models for analysis. In this problem, we have to automate the detection of birds and frog species based on sound recordings.

link to the problem statement,

2. Problem Statement:

In this problem, we have to predict the species id of the birds or frog species based on the soundscape recordings of the species. This resulting real-time information could enable earlier detection of human environmental impacts, making environmental conservation more swift and effective.

3. Source of data:

This data is provided by Rainforest Connection (RFCx) for Kaggle competition, the data can be downloaded from,

4. About the dataset:

This dataset contains train_tp CSV files which contain 1216 rows which contain species_ids, songtype_ids, f_min, f_max, t_min and t_max values. The CSV files contain the file ids and the corresponding .flac audio files in the train folder and the train tfrec records also. There is another CSV file train_fp which contains incorrect species ids and the corresponding flac audio files in the train folder. There are also 1992 test files in the test folder that should be used to predict the values using the model for testing.

5. Deep learning problem formulation:

In this problem we have to predict the species id using the audio files. There are 24 different species of birds and frogs in the dataset and one audio files can belong to multiple species also, so this problem is a multi-class classification problem.

Performance Metric:

The metric used for evaluation is label-ranking-average precision(LRAP), this metric is linked with average precision score but based on the notion of label ranking instead of precision and recall.

6. Existing Solutions:

This problem is a Kaggle problem hence there are multiple different solutions in the notebooks section of this problem in Kaggle.

7. First cut solution:

  1. Performing univariate analysis on dataset
  2. Performing bivariate analysis on dataset
  3. Creating data pipelines
  4. Training the model
  5. Quantization of model
  6. Deployment of model

8. Univariate analysis on dataset:

For our model we are going to be using only true positive data from the dataset for training so we are going to focus only on train_tp data.

Plotting a countplot of species ids,

The above plot represents the count plot of the number of species id occurring in the false-positive data. We can see that species number 23 occurs the most number of times almost 400 times throughout the dataset and features like 6, 10 , 17, and 22 occur 230–250 times and all other features occur around 100–150 times in the dataset.

Plotting a countplot of songtype ids,

From the above countplot we can see that the songtype id 1 occurs most number of times around 1050 times and whereas the other songtype id 4 occurs only 120 times in the dataset.

Plotting a PDF of f_min and f_max values,

This distribution plot shows us the PDF of f_min and f_max values in the dataset. As we can see from the distribution and the histograms f_min values occur the most number of times around 0–200 range as seen from the tall plot of histogram and the f_max values occur most number of times at around 5000 range as seen from the plot.

Plotting a t_min and t_max values,

The above plot represents the PDF of t_min and t_max functions as we can see from the plot both the values are almost equal to each other with only slight difference in value range between them.

9. Bivariate Analysis on dataset :

Plotting a plot of distribution of f_min values for various species_ids,

In this plot we are plotting the f_min for true positive data with the species id this shows us the various values of f_min associated with each species_id. We can see that for species id 14 f_min lies between 2300 and 3700 and for 20 lies around 1500–2200 and various other species represented by a single straight line like 1 and 5 show us that for these ids f_min lies in a vaery small range or even a sinle value.

Plotting a plot with distribution of f_min with various species_ids,

In this plot we are plotting f_max values for true positive data and the species_id from the plot we can see that the value of species 14 is spread out over a large area in the range from 4000–6200 and most other species are represented by a small histogram indicating that for those particular species type the value lies in a small range and for species 13 it has the lowest max frequency of around 800.

Plotting a distribution plot of t_min values for various species_ids,

This plot is a representing the distribution of t_min values for each species_id as we can see from the plot the plots for the species ids are overlapping so we can say that t_min does not change very much with the species_ids.

Plotting a distribution plot of t_max values with various species_ids,

The above plot represents the distribution plot of t_max values with the species_ids as we can see from the plots the distributions are largely overlapping meaning that the t_max values are not very much dependent on the species_ids

10. Creating data pipelines:

We read the audio files and convert these audio files into mel-spectrograms and then pass these mel-spectrograms to the model.

Mel-spectrograms:

Mel spectrogram is a spectrogram with mel-scale as its y axis. The Mel Scale, mathematically speaking, is the result of some non-linear transformation of the frequency scale. This Mel Scale is constructed such that sounds of equal distance from each other on the Mel Scale, also “sound” to humans as they are equal in distance from one another.

source (medium)

We can compute the mel spectrogrma by using the following code snipped

Data Augmentations:

Modifying the train data which is the mel-spectrogram data with augmentations so that the model becomes more robust.

The various types of augmentations used here are,

  1. flip_left_right:- This augmentation flips the spectrogram image from left to right
  2. flip_up_down:- This augmentation flips the spectrogram image from up to down
  3. random_contrast:- This augmentation randomly changes the contrast of the image
  4. random_brightness:- This augmentation changes the brightness of the spectrogram image

Randomly applying the above augmentations on train data by using the below function,

Using the above functions we create a dataset using tf.dataset module for train, validation and test dataset using the below code snippet.

we are creating the one-hot encoded species_ids also and returning them when creating the dataset.

11. Training Densenet121 model:

After training various models densenet121 gave the best results so this model was selected to be used for final model. The model was trained with imagenet weights (transfer learning).

Loss function:

The loss function used here is SigmoidFocalCrossEntropy. It down-weights well-classified examples and focuses on hard examples. The loss value is much high for a sample which is misclassified by the classifier as compared to the loss value corresponding to a well-classified example. This loss function is implemented in the tensorflow-addons module so this loss function was used from there.

Optimizer:

The optimizer used for this model is RectifiedAdam. RectifiedAdam technique was proposed in paper, where they found the models were struggling to generalize in the first few epochs and had very high variance so they overcame this problem by applying warmup with low initial learning rate and turning off momentum term after a few training batches. This optimizer was found to give better results in lesser epochs. This is also implemented in the tensorflow-addons module.

Model Architecture:

The model architecture used for densenet model is as follows,

The imagenet weights were used to initialize the models and all the imagenet weights were updated while training the model and the the last few perdictions layers used were batch normalization, dense and dropout layers to avoid overfitting.

Results:

This model gave a validation label ranking average precision score of around 0.70.

12.Quantization of Models:

The size of the model obtained after training was very large so to reduce the size of the model and to improve the performance speed of the model various types of quantization were performed on the model after training. The various quantizations performed are discussed below,

Float16 Quantization:

Float16 quantization of the model reduces the weights of the model from float32 to float16 which reduces the size of the model by half with a slight decrease in the model performance. In this case the performance of the model decreased by 0.004 which is very less when compared to the speed gains.

Dynamic Range Quantization:

Dynamic range quantization strategically quantizes weights of the model from floating point values to integers which has 8 bits of precision. The dynamic range quantization reduced the performance of the model by 0.005 but reduced the size of the model by almost 4 times.

The dynamic range quantization model is optimized only for ARM based CPUs and to be used in microcontrollers hence there is no optimization for x86 architecture yet so even though the size of the model reduced the time taken for predicting using this model was very high. As mentioned in this github page,

13. Deployment:

A webapp was created by using streamlit for all the quantized and unquantized models and was deployed on heroku, which can be accessed here,

14. Kaggle Scores:

Upon further optimization of various models the best score I was able to get a score of 0.845 public score for test data on kaggle.

15. Future works:

  1. Try various other model architectures to improve the model performance.
  2. Try some ensemble techniques to improve the model performance.

16. References:

  1. https://www.tensorflow.org/guide
  2. https://www.kaggle.com/khoongweihao/resnet34-more-augmentations-mixup-tta-inference
  3. https://www.appliedaicourse.com/

Github Link:

LinkedIn Profile:

--

--