Agricultural parcel delineation based on multitemporal Sentinel-2 data
Machine and deep learning approaches for instance segmentation

Executive summary

Using machine and deep learning techniques to automate the delineation of agricultural parcels based on satellite data is a promising yet challenging task in computer vision. In this context, two specific approaches from the field of supervised learning are presented in this work. Specifically, it is analysed how the problem, which may most naturally be framed as an instance segmentation task, can be approached by leveraging traditional knowledge-based computer vision techniques embedded in a machine learning based framework in comparison to more recently evolving convolutional deep learning methods. For the former branch, a computationally lightweight edge-based, hierarchical segmentation is used, whose parametrisation is carried out in an efficient manner using Bayesian optimisation. Regarding the deep learning approach, Mask-RCNN is used as a well-known baseline model and refined towards a new architecture that incorporates modular improvements including backbone attention mechanisms, rotational invariance and mask scoring. Both branches of methods are benchmarked by utilising one of the largest open-source datasets consisting of multitemporal Sentinel-2 data and correspondingly labelled ground truth for a diverse set of different agricultural production systems across Europe. In line with expectation, the results of this study show that the machine learning model with a small number of parameters is capable of producing instance segmentations from just a few samples – it reaches a performance close to its optimum with a mean Average Precision (mAP) of 0.07 when using 10 labelled tiles of size 256×256. However, if labelled data is abundant, the presented deep learning-based approach with its higher capacity to learn the specifics of the given task exceeds the accuracies of the machine learning model by a large margin – an mAP of 0.17 is achieved when using 4000 labelled tiles for training. Using a rigorous evaluation schema with an extensive set of metrics, this study provides valuable indications regarding the distinct circumstances in which both methods prevail or falter.

Topic relevance & Research objectives

The automated delineation of agricultural land is pivotal for a number of remote sensing based downstream tasks. Among the most prominent are land use classifications with the differentiation of crop types, but also analyses of the phenological development of plants. In both cases, the existence of plot boundaries enables analyses at the level of meaningful field objects with the aggregation of pixel-wise information. Beyond immediate remote sensing based downstream tasks, automated parcel delineations may provide value for the creation of cadastral information in developing countries and thus contribute to tenure security in turn linked to an efficient exploitation of land as an economic resource. Finally, ecological analyses on biodiversity, which is strongly linked to landscape structures, may benefit from the availability of parcel delineations.

Downstream tasks benefiting from automated parcel delineations

Existing research on automated parcel delineations exhibits several deficiencies. Most studies focus on developing an approach for either one of the two major branches of knowledge-based, traditional Artificial Intelligence (AI) approaches or purely data-driven deep-learning approaches. Classical AI approaches often neglect elements of supervised machine learning. They tend to be based on complex ruleset-based workflows that require manual reparametrisation to apply them in other Area of Interests (AoIs). Meanwhile, deep learning based approaches mostly frame the problem as a semantic segmentation task instead of the more intutive conceptualisation as an instance segmentation task. A comparison between non-deep learning and deep learning based approaches on large scale data sets is also largely missing. Many studies are using small-scale, self-labelled data sets. This correlates with a limited diversity of the represented landscapes and field structures. Therefore, the question on how a certain algorithm performs in different AoIs remains unanswered.

Based on the research gaps identified before, the aim of this work is to develop and compare two supervised learning approaches for instance segmentation of agricultural parcels. Specifically, the main research objectives and contributions are two-fold:

  1. Analyse to what extent machine and deep learning approaches built on existing works can be improved by incorporating architectural changes aiming to resolve identified shortcomings in the corresponding models.
  2. Get an estimation of the models performances by training, testing and comparing them on subsets of different sizes that are drawn from a large open dataset covering a diverse set of landscape structures.

Data & Study Area

This work relies on the AI4Boundaries data set containing pairs of preprocessed multi-temporal satellite imagery and label information on field extents, field boundaries, distance transform masks and enumerations of fields. The monthly Sentinel-2 (S-2) composites cover a six month period in 2019 and contain five bands with a spatial resolution of 10 m: R, G, B, Near Infrared (NIR) and the derived Normalized Difference Vegetation Index (NDVI). The S-2 data was preprocessed to level-2A surface reflectance products and cloudfree composites were derived by interpolation. Specifically, cloud masking was performed using the level-2A scence classification layer complemented by an outlier filter. The resulting timeseries of non-masked pixels were smoothed and their median value was taken. The label information was created using a vector data set that was compiled based on the annual geospatial aid application declarations made by farmers to receive support from the common agricultural policy of the European Union. Images and labels were compiled for a total of seven European countries.

S-2 timeseries (true-color RGB and false-color infrared) with label information. Tiles have an extent of 256x256 pixels

During the exploratory data analyses, some flaws in the AI4Boundaries data set were noticed. The most serious ones – introducing many false-positives to the data sets – concern…

  • mislabelling of non-agricultural areas as agricultural parcels (frequently occuring for scenes in the Alpine region)
  • fragmentations of parcels, i.e. extreme oversegmentations into sets of pixelwise segments

Accordingly, extendend pre-processing based on CORINE Land Cover (CLC) products as auxiliary data and geometric/shape-related filter criteria was applied to derive an enhanced version of the AI4Boundaries data set. This filtered dataset contains 4955 tiles à 256×256 pixels, which is about 2/3 of the original size of AI4Boundaries.

A unique feature of the AI4Boundaries dataset compared to other datasets is its heterogeneity with the coverage of different landscape structures and management patterns. To adequately describe this diversity of the dataset, three complementary representations can be used:

  1. The geographical localisation of the tiles with their dispersion across countries and correspondingly different biogeographical zones and climates
  2. Deep feature clusterings of the satellite image timeseries
  3. Distributions of label-specific information such shape characterisitics of the parcels
Whereas the 1st representation is intuitive, it only gives indications of the diversity of the data set. For example, it is likely, but not necessarily, that management patterns differ depending on location. Intensive arable farming, for example, could be practised in similar ways in different European countries. In order to confirm that the spatially stratified sampling design with its coverage of different countries does indeed constitute a diverse data set, a targeted analysis of the image and label information itself is required (2nd & 3rd representations). Deep feature clustering provides an expressive means to systematically screen and represent the thousands of multispectral and -temporal images. In this process, feature descriptions for each image are obtained by applying a pre-trained CNN and extracting the high-dimensional vector of the penultimate layer. By applying a following dimensionality reduction algorithm (such as t-Distributed Stochastic Neighbour Embedding (t-SNE)), each image’s vector can be mapped as a point into a 2D space. Possible clustering within this space informs about the structural similarity of images. For the statistical analysis of the label information, the distribution of field sizes (in ha) provides a natural description of the heterogeneity of label information. Other shape-related descriptors such as eccentricity (deviation of an ellipse from a perfect circle, defined as the ratio of the distance between its focal points to the length of its major axis) and solidity (ratio between area of an object to the area of its convex hull) can complement this characterisation by enabling statements about the compactness of fields.


Model A: Machine learning driven OBIA

As for the machine learning based approach, this work utilises an edge-based segmentation with a watershed tree at its core. This is followed by a Random Forest (RF) classifier to distinguish between agricultural and non-agricultural land. Due to the two stage character of this approach consisting of segmentation followed by classification – both parts incorporating knowledge about scene understanding and carried out in a supervised manner – we refer to this as a machine learning driven OBIA approach.

Machine-learning driven OBIA workflow

Hierarchical watersheds are multi-scale representations of an image describing a sequence of watershed segmentations. First, a gradient map is represented as an edge-weighted graph with the mean gradient intensities as its vertices weights. The edge-weighted graph can be turned into a hierarchical tree representation by ranking the minima of the edge-weighted graph according to a criterium such as dynamics. Any thresholding of the tree below the root node’s altitude results in a segmentation of the image into a set of disjoint regions that collectively cover the whole image. Starting from image gradients, such as those given by field boundaries, hierarchical watersheds thus enable scale-adaptive segmentation by choosing a suitable height for cutting the tree. Given this core element for the segmentation stage, additional modifications to refine the segmentation result can be added and systematically tested. Among these modifications are for example approaches to include geometrical compactness constraints in the creation of the hierarchical watershed. Further enhancements relate to preprocessing steps for image enhancement – Fuzzy Histogram Hyperbolisation (FHH) and Contrast-Limited Adaptive Histogram Equalisation (CLAHE) – or postprocessing steps suhc as contour simplifications. In the experiments carried out in the course of this thesis, most of these modifications turned out to result in none or only marginal improvements. The segmentation pipeline depicted above was therefore finally reduced to its core part of gradient calculation and hierarchical watershed creation while omitting pre- & postprocessing steps. The parameters guiding the segmentation, such as the altitude to cut the watershed tree, were determined by running a Bayesian optimisation with a loss function relying on the F1 score between the predicted segmentation map and the ground truth.

To perform the binary classification step, an RF model as an ensemble model characterised by its computational efficiency and its robustness to overfitting is used. To perform classification on the segment level, two basic options exist: Either one performs pixelwise classification and aggregates the results afterwards or object-specific features are calculated and the RF is trained to classify whole segments. The former is computationally more efficient since the post-aggregation step is carried out fast, whereas object-specific feature calculation depending on the choice of features tends to be slow. Contrary, object-specific feature sets can be more comprehensive as shape characteristics of segments and their neighborhood relationships can be included. Consequently, both categories of features set categories and their corresponding ways of information aggregation were assessed. In the course of the hyperparameter experiments, the pixel-specific feature sets with post-aggregation proved to be superior to the object-specific pre-aggregation strategy. A feature set consisting of spectral and textural features computed on the pixel level complemented by environmental covariats at the tile level (precipitation, temperature and biogeographical region) was thus chosen for the final model configuration.

Model B: Deep instance segmentation

The basis of the deep learning approach is a Mask-RCNN model. It builds on Faster RCNN as a model for object detection. Both architectures belong to the family of two-stage detectors, which predict rectangular object proposals, so-called Regions of Interest (RoIs), and refine these results in a second stage. Mask-RCNN expands the functionality of Faster RCNN by including a mask branch that performs semantic segmentation to derive instance masks and not just bounding boxes for each region proposal. Building on this foundation of Mask-RCNN, several partial improvements to the architecture can be examined for their potential for the given task of parcel delineation (B,C,D,E in the figure below).

Architectural elements for the deep learning approach

  • The most basic modification addresses the enhancement of the backbone’s feature extraction capabilities by including an attention module. To this end, Efficient Channel Attention (ECA) is used to recalibrate channel-wise feature responses based on their interdependencies.
  • A more advanced modification concerns the introduction of an oriented Region Proposal Network (RPN) head followed by oriented bounding box and mask heads.  The common Mask RCNN architecture extracts axis-aligned region proposals and bounding boxes, which poses problems in the case of diagonally oriented, adjacent, elongated, narrow field strips. Enabling rotated region proposals and bounding boxes aims at increasing the expressiveness of features extracted for each object and avoiding the suppression of too many objects in cases where they are close to each other.
  • The replacement of the standard mask head by a PointRend head aims to increase the boundary precision of the mask predictions. Due to the limited dimension of the RoI-aligned feature maps (7×7) and the mask prediction (28×28), the mask boundaries given by the common Mask-RCNN are often oversmoothed. The PointRend head tries to counteract this by refining coarse scale mask predictions iteratively using an adaptive subdivision algorithm and making point-wise predictions for the most uncertain regions close to the mask boundaries.
  • An alternative head modification, the mask scoring head, is directed towards improving the informative value of the mask score. The mask score is usually used in the evaluation and postprocessing of the results as a confidence of instance classification. In the original Mask-RCNN architecture, however, it simply corresponds to the bounding box classification score while neglecting shape information on the mask prediction. The mask scoring head solves this problem by using a second block of convolutional and fully connected layers to regress the Intersection over Union (IoU) between mask and ground truth and include this in the mask score calculation.

The experiments on the choice of the final model architecture showed that especially the modifications regarding oriented RPN head and mask scoring lead to significant increases in accuracy. Smaller improvements are achieved by the ECA and the PointRend. Accordingly, a combination of the elements A, B, C & E was chosen for the final model constituting a new architecture for instance segmentation that is called oriented mask scoring RCNN (OMS-RCNN).

Results & Discussion

To evaluate the models, two sets of metrics are used – classical instance segmentation metrics quantifying the global performance of a model and geometric segmentation metrics allowing to report detailed results on the individual field level and to analyse different types of errors that are occurring.

  • For instance segmentation metrics, mean Average Precision (mAP) is the central metric summarising the precision-recall tradeoff of detecting objects over a range of IoU thresholds. Specifically, IoU values are used as a criterion to match predictions with ground truth objects and categorise them as true positives, false positives or false negatives. Ranking the predictions according to a confidence score, a precision-recall curve is created. Integrating the area under the curve results in the Average Precision (AP) for a given IoU threshold, e.g. AP50 and AP75 measured for the IoU value of 0.5 respectively 0.75. The average of APs over a range of IoU thresholds in the interval from 0.5 to 0.95 finally results in the mAP. Commonly, mAP scores are also reported for different object sizes. For the given study, parcels are defined as small when their area is below 1 ha, medium when they are 1-2.5 ha and large when they are above 2.5 ha.
  • For geometric accuracy measures, the usual IoU measure is supplemented by a more restrictive Boundary IoU (BIoU). BIoU quantifies the IoU for mask pixels within a certain distance from the boundaries of a prediction and its corresponding ground truth, thereby avoiding the insensitivity of the usual IoU to errors in larger objects. The distance tolerance was set to 20 m for this study. While IoU and BIoU are symmetric metrics, i.e. a swap of ground truth and prediction masks does not change their values, the metrics Intersection over Reference (Ioref) and Intersection over Prediction (Iopred) are included to quantify tendencies of over- and undersegmentation. The same applies to the Mean Average Error with respect to the Reference (MAEref) and Mean Average Error with respect to the Prediction (MAEpred), which allow an intuitive quantification of the error in the absolute unit of meters.

The overall summary on achieved accuracies depending on the number of training samples is depicted below. As expected, model A as a robust machine learning model performs decently even with very small sample sizes. In fact, the maximum achievable performance for model A is almost reached with a sample budget of 10 tiles. For the AP metrics as well as the summary IoU and BIoU, there is a slight increase in accuracy when using 50 tiles. From then on, the curve of accuracy as a function of sample size reaches its plateau and further increases in sample size do not lead to significant improvements. Conversely, for model B as a deep learning model with high complexity and expressivity, there is a steady increase in accuracy as a function of the training set size up to the already described clear superiority of the model at a sample size of 4000 tiles. Across the various metrics, slightly different sample size ranges can be identified at which model A and model B perform on par. For the balanced summary metrics, the critical sample size ranges are about 50 tiles (mAP), 100 tiles (IoU) and between 100-250 tiles (BIoU). Due to the differing tendency of both models to oversegment, for some of the non-symmetric geometric measures of accuracy such as Iopred the deep learning model requires more than 250 tiles to outperform the machine learning model.

Model accuracies depending on number of training samples.
For the geometric accuracy metrics both, the mean (solid line) and the median values (dashed line), are shown.

To allow a more detailed analysis of types of errors and spatial patterns, the following figure provides a less aggregated view on the results based on the geometric accuracies measured on the instance level. While many different pieces of information can be extracted from these representations, the most prominent and relevant ones in light of the research questions are the following:

  • In accordance with the above findings, the accuracies for the deep learning model are significantly higher than those of the machine learning model. Improvements are seen across all geometric accuracy metrics as evident from the histograms and the corresponding statistics on mean and median values.
  • Despite different accuracy levels, both models show a high similarity in spatial patterns. For example, the predictions measured by the summary metrics IoU and BioU for northern and central France are most accurate in both models, while the worst results are obtained in case of the Alpine region, Slovenia and northern Sweden. The increased capacity of the deep learning model increases the prediction accuracy compared to the machine learning model relatively evenly across all spaces, whereas the absolute level of accuracies seems to be strongly tied to the inherent difficulty of field delineation depending on a given landscape structure. It seeems intuitive that for any supervised framework the recognition of well-structured intensively managed agricultural areas in the continental and atlantic biogeographical zones is easier than the delineation of parcels in areas of alpine grassland management. Likewise, complex agricultural management patterns with fields of below-average size, as they are predominantly occurring in Slovenia, are naturally hard to segment.
  • Both models have a tendency to oversegment as evidenced by the higher values of Iopred compared to Ioref. This tendency is more pronounced in model A than in model B, which can be attributed to the design of the loss function in model A. The F1 score based loss is oriented towards the optimisation of the average IoU across the tiles measured between the pairs of ground truth and predictions. An undersegmentation of a few objects influences the average IoU more than a slight oversegmentation of many objects such that the corresponding parameter combination optimised at the tile level tends to oversegmentation by choosing lower tree cutting thresholds. In addition, a possibly high number of false positive predictions in the sense of fragmentation of a single ground truth object into many predictions is not penalised as the IoU is only evaluated for the best matching prediction per ground truth. A less pronounced tendency to oversegmentation (as in the case of model B) may generally be seen as a positive property for a number of downstream tasks. Compared to potential undersegmentation, oversegmentation has fewer adverse effects, for example, on remote sensing-based analyses of vegetation phenology or crop classifications.

Geometric accuracy metrics for model A (left) and model B (right). Frequency distributions and statistics are shown together with spatial patterns of median values. For the latter, metrics for individual fields are aggregated on a hexagonal grid with a side length of 32.5 km. The frequency distributions and statistics are based on the spatially non-aggregated results for individual instances. For both models, all depicted accuracies refer to the scenarios with the largest training sets (i.e. 250 samples for model A and 4000 samples for model B).

To complement the quantitative evaluation at the aggregate level with a qualitative visual examination of individual results, the following figures present a selection of predictions of different qualities for each biogeographical zone. Note that the oversegmentation tendency of model A is clearly evident and often occurs in cases of highly textured satellite imagery (e.g. top row Atlantic, bottom row Mediterranean). The Canny edge detector used to obtain the gradient information as input for the hierarchical watershed is too simple to exclude this textural noise. Contrary, the deep learning model is able to adapt to the heterogeneity of the input images and to learn the necessary criteria for extracting relevant boundaries from a diverse set of scene structures. The high quality of the predictions that represent medium performing examples (middle row) and best performing examples (bottom row) indicates the overall broad applicability of the deep learning model. Note that the comparison of the predictions for individual tiles with the corresponding ground truth also reveals a number of cases where low accuracies are due to quality deficiencies of the ground truth rather than inaccurate predictions.

Exemplary parcel delineations for model A (left) and model B (right). For every biogeographical region, worst (top row), average (middle row) and best (bottom row) predictions measured by the median IoU with the ground truth are visualised. The ground truth is shown in white and the statistics represent the median values for the given tile.

Supplementary resources