Ranking will be based on three metrics. They are combined to compute a final score as follows: 

  1. Teeth localization accuracy (TLA): mean of normalized Euclidean distance between ground truth (GT) teeth centroids and the closest localized teeth centroid. Each computed Euclidean distance is normalized by the size of the corresponding GT tooth. In case of no centroid (e.g. algorithm crashes or missing output for a given scan) a nominal penalty of 5 per GT tooth will be given. This corresponds to a distance 5 times the actual GT tooth size. As the number of teeth per patient may be variable, here the mean is computed over all gathered GT Teeth in the two testing sets.
  2. Teeth identification rate (TIR): is computed as the percentage of true identification cases relatively to all GT teeth in the two testing sets. A true identification is considered when for a given GT Tooth, the closest detected tooth centroid : is localized at a distance under half of the GT tooth size, and is attributed the same label as the GT tooth
  3. Teeth segmentation accuracy (TSA): is computed as the average F1-score over all instances of teeth point clouds. The F1-score of each tooth instance is measured as: F1=2*(precision * recall)/(precision+recall)

TLA and TIR together capture the identification and localisation capabilities of the algorithm. - TSA is a fine grained evaluation of 3D points labeling and directly reflects the quality of the teeth segmentation.

Global ranking score

The global ranking will be based on an averaged sum of the teeth localization accuracy(TLA), teeth identification rate (TIR) and teeth segmentation accuracy (TSA).

Score = [Exp(-TLA) + TIR + TSA]/3.

In addition, we will provide intermediate rankings based on each metric in order to highlight the performances of the competing methods in each of the three subtasks, i.e., localization, identification, and segmentation.

Winners will be the top global ranking methods. Although the challenge tasks are not expected to run in realtime, we will add statistics on runtime performances for each participant algorithm as estimated on the challenge platform.