Masters programme | E-portfolio
Semester independent

Internship – UAV-based regression of biomass

Comparison of three different machine/deep learning approaches to perform regression for fractional vegetation coverage, vegetation height and vegetation volume. Tackling the issue of sparse training data vs. high model complexity

Table of Contents


During an eight week internship, I had the opportunity to compare a set of methods to perform UAV-based retrieval of vegetation parameters. This research was conducted in the framework of a larger project called SixP, which deals with ecological analyses of plants growing in metal-rich soils in former mining areas. The specific aim in the current part was to investigate computer vision approaches to perform the regression of fractional vegetation cover, vegetation height & vegetation volume. A specific focus was to explore the possibilities of deep learning approaches given the fact that only very few in-situ-measurements were available as training data. The problem that the restricted availability of training data limits the applicability of complex computer vision approaches due to the risk of overfitting was tackled by designing models that have as few parameters as possible and/or rely on extended training datasets.

Data basis & preprocessing

UAV and in-situ-measurements were taken for 5 sites in different altitude zones in the Pyrenees in the period spring to summer 2021. All sites are located within a 5 km radius and have extensions of less than 100 m in each direction. UAVs were equipped with R, G, B, NIR sensors and the obtained images post-processed to derive at orthophotos with a resolution of either 2 or 3 mm. To derive at a homogenous data set across sites, the UAV data was resampled to 3 mm resolution and contrast stretching applied for each site individually. In-situ-measurements were taken for the three variables “fractional vegetation coverage”, “vegetation volume” and “vegetation height”. A spatially stratified random sampling design with about 40 measurements per side was applied leading to a total of 198 measurements across all sites. For each point, fractional vegetation coverage, vegetation volume and plant heights were recorded based on a 25 x 25 cm square representative for the 1 x 1 m area centered around the point. The following figure provides some examples for these 1 x 1 m tiles together with the taken in-situ-measurements. Below each tile a selection of other tiles with comparable in-situ-measurements are shown to compare different measurements & corresponding tiles. This method was also used to manually screen the data set and exclude tiles with unreliable in-situ-measurements from further evaluations. Thereby, a set of 160 measurements and associated UAV tiles remained.

Statistics on the distribution of the values and the correlation of variables are displayed subsequently. It is evident that the vegetation variables vary not only within but also across sites. La Plagne representing the lowest altitude site shows the strongest vegetation, whereas Chichoue Milieu Bas, for example, shows only sparse vegetation.


The following three approaches are evaluated systematically. They are ordered according to increasing model complexity and corresponding requirements regarding data amounts and computing resources:
  1. Model A: Given the low amount of data points, one way to solve the regression problem is to rely on less complex machine learning models instead of deep learning based computer vision. In remote sensing studies, the random forest (RF) classifier proved to be a robust algorithm which has been used many times across a wide range of applications. To apply its regression-related equivalent – the RF regressor – first a set of meaningful features has to be engineered. A simple automated tile cover classification distinguishing between vegetation, bare ground and shadowed areas is introduced first to allow focusing on the vegetated parts of the tiles. To this aim, image segments with mean NDVI values above a certain threshold are classified as vegetation. The classification of shadowed areas is based on consistently low reflectance values across the channels R, G, B. Given the classified tiles, engineering two types of features seems to be reasonable in order to approximate the human vision process. First, spectral features and their derivates, i.e. the share of vegetation, bare ground and shadow, may be used for the regression of vegetation parameters. Second, the spatial arrangement of these classes may be helpful as well. Different proxies may be used to characterise this spatial aspect (e.g. local and global correlation measures, textural metrics). For this study, the size and shape properties of objects derived from applying a felzenszwalb segmentation are used. Specifically, the mean value and standard deviation of the size and shape index of all objects classified as vegetation are taken into account as features. In total, this makes a set of less than 50 features derived from the images fed into the RF regressor.
  2. Model B: As a second approach, one could automate the feature extraction process by using an autoencoder (AE) on the unlabelled dataset. AEs are neural networks consisting of andecoder part aiming at reconstructing the input image based on an intermediate latent (i.e. encoded) representation. Building on the encoding part, a small follow-up CNN performing the actual regression task can be constructed. The benefit of performing a dimensionality reduction via AEs is that also unlabelled image tiles can be used to perform this pre-task. Therefore, the complete UAV images tiled into 1 x 1 m patches can be used to train this model. For the given task, simple AE architectures with the encoding part consisting of 3 convolutional layers and the AE latent representation size of 32 x 42 x 42 were chosen. This dimensionality of the latent feature space allows to create a simple regression CNN with less than 10 K parameters for the second step based on those tiles for which in-situ-measurements are given (i.e. labelled tiles). The regression CNN starts with a 1D convolutional layer intended to reduce the number of feature maps to the ones needed for the task at hand. Subsequently, 3 2D convolutional layers are used again. Average pooling is applied to the output, the features are flattened, fed into a linear layer and sigmoid is applied to force the output to take values between 0 and 1. Rescaling by multiplying with factor 100 represents the final step to get the common value range of the variables of interest.
  3. Model C: An alternative to reducing the model complexity is an extensive expansion of the data basis. Applying simple standard architectures for image analysis such as resnet-18 or efficientnet-b0 usually requires at least a few thousand labelled images as input. Such an expansion of the database by up to two orders of magnitude can hardly be achieved using standard data augmentation procedures. However, given the current setting with a great number of unlabelled tiles, one can rely on a procedure of generating pseudo-labels for these tiles (subsequently called „data synthesis“). The attribution of pseudo-values is done based on image similarity analyses. Using the resnet-18 pretrained on the ImageNet cutting of the last layer, 512 deep features are calculated for each UAV tile. The vicinity of two vectors in this feature space is assumed to indicate the structural similarity of images so that unlabelled tiles with their vectors being close to vectors of labelled tiles can be given their values. However, to circumvent the issues arising from the high dimensionality of the input space and also to visualise and thereby validate results, vectors are mapped to 2D space first before performing this kind of nearest neighbour assignment of values. The mean nearest neighbour (NN) distance between all low dimensional points is calculated and used as a distance threshold for assigning values to unlabelled tiles. If the distance between an unlabelled point and one or multiple labelled ones is lower than the mean NN distance, a pseudo-value is calculated for the unlabelled one using the weighted average of the values of all the labelled tiles being considered. Leaving points, whose NN distance exceeds the mean NN, unlabelled aims at assigning pseudo-values only to those tiles that are highly likely to be similar to the labelled ones.

Experimental Settings

Both intra- and inter-site generalisability and transferability of trained models was tested. For inter-site testing, all measurements and tiles for the site Chichoue Milieu Bas were excluded from training and used for testing. For intra-site testing, tiles were split into train, validation and test portions in a spatially non-overlapping manner. Whereas the labelled tiles are barely overlapping anyway, considering overlap is important in case of the spatially contiguous unlabelled tiles in order to avoid bias in the evaluation of the results. For the regression part, train-validation-test splits were made 10 times in order to perform cross-validation of the results. Prior to training & evaluating the final models, several tests on the validation data were performed in order to find suitable settings for the hyperparameters on the neural networks (model B & C). Especially the learning rate, loss function and optimiser were analysed comparing the mean absolute error (MAE) for different settings. For the regression task, the following final metrics were used to assess the accuracy of results: mean absolute error (MAE), mean squared error (MSE), Person’s correlation coefficient (r), coefficient of determination (R2) and Spearman’s correlation coefficient (s). For the evaluation of AE-based reconstruction abilities, five different metrics are used: MAE, MSE, angle between spectra (SAM), peak signal noise ratio (PSNR) and structural similarity index measure (SSIM). Trainings of the final models with determined hyperparameters as well as their corresponding evaluations on the test data set were generally performed multiple times. This aimed at assessing the stability and reproducibility of the trainings and enabling the selection of the best performing model based on the validation metrics to avoid unfavourable local minima solutions.

Results & Discussion

An overview on the performance of all models is shown below. The different matrices essentially agree with each other in terms of their informative value about model quality. Vegetation coverage is predicted most reliable by all models with mean R2 values consistently above 0.85. For the other variables lower shares of explained variance in the range of 0.5 – 0.6 (vegetation volume) and 0.3 – 0.5 (vegetation heights), respectively, are obtained. Comparisons across the different models indicate the superiority of the data synthesis based approach (model C) in all cases. Its mean MAE is slightly lower than those achieved in case of the random forest (model A) or the AE-based approach (model B). Complementary, higher correlation coefficients are measured as well. Model A and B tend to perform similar for vegetation volume and heights whereas A performs better compared to B for the vegetation coverage regression. However, differences between models are rarely significant. Given the small number of test samples and the limited number of folds for cross-validation (10 train-val-test splits), the 95 % confidence intervals around the mean metrics are relatively large.
Accuracies for all models (rows) & variables (columns).
Means across the 10 train-val-test splits together with 95 % confidence intervals (CI) are displayed.
From a user’s point of view, it has to be decided whether the performance gain of model C relative to the other models justifies the associated processing effort with corresponding time and resource input. In the sense of this holistic perspective, the follwoing table gives a general overview of relevant comparison criteria beyond the accuracy orientation. The machine learning based model A, for example, is characterised by the fact that its setup requires very little effort and still achieves good results. Only the prediction effort at runtime is comparatively high due to the need to calculate object-oriented features for each tile of interest. Based on the analyses conducted so far, the autoencoder-based approach (model B) has not proven successful, as the model does not outperform the model A despite considerable additional development effort.
Summarising comparison of the models

Appendix - Scripts & Full report

In order to make to analyses reproducible and fully comprehensible, the scripts used for processing as well as an extended report on the project are added below.