Mar 12, 2026

Dental AI: building a tooth segmentation model with YOLOv12

Applying modern object detection to dental radiographs is harder than it looks. Here is the dataset challenge, the model architecture choices, and what the deployment pipeline looks like.

aicomputer-visionhealthcarepython

Dental AI: building a tooth segmentation model with YOLOv12

This started as part of a larger dental AI platform. The goal was to take dental X-rays (panoramic and periapical radiographs) and automatically segment and classify individual teeth. Sounds straightforward. It is not.

Dental imaging is a genuinely hard computer vision domain. The data is scarce, the annotations are expensive, and the images from different X-ray machines look completely different. Here is what building this actually looked like.

the dataset problem

Annotated dental radiographs are rare. Medical imaging datasets in general are limited by privacy regulations and annotation cost. Getting a radiologist to manually draw segmentation masks around 32 teeth in a panoramic X-ray takes real time and real money. We could not just scrape the internet for training data.

The dataset we ended up with had several challenges:

Class imbalance: some tooth types (central incisors, first molars) appear frequently. Third molars (wisdom teeth) are often absent or partially erupted and underrepresented.
Equipment variation: panoramic X-rays from different machines have different contrast profiles, zoom levels, and distortion characteristics. A model trained only on images from one scanner generalizes poorly to others.
Pathology: teeth with crowns, implants, root canals, or heavy decay look significantly different from healthy teeth. These need to be represented in training data or the model fails on the most interesting cases.

The augmentation pipeline ended up being as important as the dataset itself. Random brightness and contrast adjustments to simulate different machines. Geometric augmentations within physiologically reasonable bounds. Synthetic tooth overlay to add pathology examples we did not have enough of in the real data.

why YOLOv12

YOLO for medical imaging might sound like an odd choice. Detection architectures are fast, but are they accurate enough? For tooth segmentation the answer is yes, for a few reasons.

Teeth are spatially distinct objects with reasonably consistent size relationships to each other in a given image. They do not overlap in healthy dentition. YOLO handles this kind of structured detection well.

More importantly, inference speed matters for clinical deployment. A model that takes 2 seconds per image is a tool. A model that takes 20 seconds is a friction point that practitioners stop using. YOLOv12 at 12ms per image on a reasonable GPU is fast enough to feel real-time during a clinical workflow.

The model configuration:

Input: 1024x1024
Backbone: YOLOv12 standard
Training: A100, mixed precision (fp16)
mAP@0.5: 0.89
Inference: 12ms per image
Parameters: 28M

The 89% mAP is decent for this domain but there is room to improve, particularly on pathological teeth and partially erupted wisdom teeth. Those are the long tail of the dataset and the long tail of where a clinical tool needs to be reliable.

the deployment pipeline

The model runs as a FastAPI service:

POST /segment
Content-Type: multipart/form-data

Response: {
  "teeth": [
    {"id": 11, "bbox": [...], "mask": "...", "confidence": 0.94},
    ...
  ],
  "processing_time_ms": 47
}

ONNX Runtime handles the inference on GPU. Exporting from PyTorch to ONNX was not entirely painless, there are a few operators that need careful handling, but the resulting model runs on any ONNX-compatible runtime without needing the full PyTorch stack installed.

Automatic batching groups concurrent requests together. Redis caches results for identical image hashes (useful for demo environments where the same test images get submitted repeatedly). Prometheus metrics track inference latency by quantile, which is the metric that actually matters for a clinical tool.

the iOS integration

This backend powers the SkinScan iOS app on the analysis side. The iOS client captures the image, uploads it to the segmentation service, receives the bounding boxes and masks back, and overlays them on the displayed radiograph.

Getting the coordinate systems aligned between the original image, the model input (resized to 1024x1024), and the iOS display coordinates required careful attention. An off-by-one in any of the scaling calculations and the mask overlays land in the wrong place. This is the kind of bug that looks fine until you test on an image with unusual aspect ratio.

what is next

Multi-class segmentation to distinguish tooth types from each other (not just "tooth" vs "background"). Pathology classification as a secondary head on the same backbone. Integration into a full clinical workflow prototype. The model is the foundation. The hard work is the product on top of it.