Skip to main content

Read Detection Boxes from Model Output

Read Detection Boxes from Model Output — animated walkthrough overview

FieldValue
DifficultyIntermediate
Estimated Read Time15-20 minutes
Labelspostprocessing, boxdecode, detection

A detector doesn't return boxes directly. Its raw output is a stack of feature maps that still needs thresholding, non-maximum suppression, and coordinate mapping before it means anything. SimaBoxDecode is the postprocessing stage that does all three in one optimized step, turning inference tensors into final detections in source-image pixels.

This chapter configures that decode — picking the model family with decode_type, gating confidence with the score threshold, suppressing overlaps with the NMS IoU threshold, and capping output with top_k — then runs the model and reads how many detections came back. By the end you will have a configured detector pipeline and a detection count read from its output, plus (in the In Practice reference below) the full wire format so you can parse boxes yourself in any runtime.

Walkthrough

Configure the decode

These options set both the input contract and the postprocessing behavior. decode_type (YoloV8 here) selects the model-family decode path. The confidence threshold drops weak candidates before NMS; the NMS IoU threshold controls how aggressively overlapping boxes are merged; top_k caps the final count for deterministic downstream cost; and boxdecode_original_width/boxdecode_original_height map decoded coordinates back into source-image pixels. Tuning guidance for each of these is in In Practice below.

decode_type takes the BoxDecodeType::YoloV8 enum. The threshold/NMS/top_k values are passed later through stages::BoxDecodeOptions, not on Model::Options.

tutorials/007_read_detection_boxes/read_detection_boxes.cpp
simaai::neat::Model::Options opt;
opt.preprocess.color_convert.input_format = simaai::neat::PreprocessColorFormat::BGR;
opt.preprocess.input_max_width = bgr.cols;
opt.preprocess.input_max_height = bgr.rows;
opt.preprocess.input_max_depth = bgr.channels();
opt.decode_type = simaai::neat::BoxDecodeType::YoloV8;

Build the model

Constructing the Model from the archive plus options binds the decode configuration to the model so the inference and postprocessing stages derived from it use the settings above.

tutorials/007_read_detection_boxes/read_detection_boxes.cpp
simaai::neat::Model model(model_path, opt);

Run preprocess, inference, and decode

This is where a frame flows through preprocess, the MLA inference, and the box decoder to produce the detection output.

The path is made explicit stage-by-stage: stages::Preproc produces the input tensor, stages::Infer runs the model, and a stages::BoxDecodeOptions (with detection_threshold = 0.55, nms_iou_threshold = 0.5, top_k = 100) configures the decode that runs next.

tutorials/007_read_detection_boxes/read_detection_boxes.cpp
simaai::neat::TensorList pre = simaai::neat::stages::Preproc(std::vector<cv::Mat>{bgr}, model);
simaai::neat::Sample infer_samples = simaai::neat::stages::Infer(
simaai::neat::Sample{simaai::neat::sample_from_tensors(pre)}, model);
if (infer_samples.empty())
throw std::runtime_error("infer stage returned no samples");
simaai::neat::Sample infer = infer_samples.front();

simaai::neat::stages::BoxDecodeOptions box(simaai::neat::BoxDecodeType::YoloV8);
(void)box.decode_type;
(void)bgr.cols;
(void)bgr.rows;
box.detection_threshold = 0.55;
box.nms_iou_threshold = 0.5;
box.top_k = 100;

Read the boxes

Finally, turn the decode output into something you can use.

stages::BoxDecodeResults(...) returns a BoxDecodeResultList; the front result's boxes vector is already parsed into {x1, y1, x2, y2, score, class_id} clamped to source pixels, so decoded.boxes.size() is the detection count.

tutorials/007_read_detection_boxes/read_detection_boxes.cpp
// BoxDecode parses the "BBOX" tensor into {x1, y1, x2, y2, score, class_id}
// entries clamped to original_width x original_height source pixels.
simaai::neat::BoxDecodeResultList decoded_results =
simaai::neat::stages::BoxDecodeResults(simaai::neat::Sample{infer}, model, box);
if (decoded_results.empty())
throw std::runtime_error("boxdecode result parser returned no results");
const simaai::neat::BoxDecodeResult& decoded = decoded_results.front();

Run

Run the Python and C++ (prebuilt) commands from the Neat install root (the directory that contains share/ and lib/); run the build from source commands from the repo root.

C++ (prebuilt):

./lib/sima-neat/tutorials/tutorial_007_read_detection_boxes \
--model /tmp/yolo_v8s.tar.gz --image /path/to/frame.jpg

C++ (build from source):

./build.sh --target tutorial_007_read_detection_boxes
./build/tutorials-standalone/tutorial_007_read_detection_boxes \
--model /tmp/yolo_v8s.tar.gz --image /path/to/frame.jpg

Expected output (the box count depends on the frame; a synthetic frame yields zero):

boxes=0
[OK] 007_read_detection_boxes

(The Python build prints detections=..., or raw_output_heads=... if the runtime does not wire BoxDecode into model.run.) To integrate this chapter's C++ source into your own project with a custom CMakeLists.txt (no extras folder required), see How to Run Tutorials on the landing page.

In Practice

SimaBoxDecode emits a single output tensor tagged BBOX. The tensor carries a packed byte buffer that the runtime parser interprets into floating-point detections. Understanding that two-level contract (wire buffer vs. parsed Box records) is the key to reading the output from either Python or C++.

BBOX tensor

The decode stage produces one BBOX tensor per input frame with:

FieldValue
semantic.detection.format"BBOX"
dtypeUInt8
shaperank-1: [N_bytes], where N_bytes is the model archive-packed buffer capacity (for example [20160] on the stock YOLOv8 pack)

The tensor shape is a byte count, not a detection count. The packed bytes hold both a small header and a contiguous array of fixed-size box records. N_bytes is determined by the model archive's buffers.input[0].size field (inside the boxdecode stage's config JSON) and bounds the maximum number of detections the decoder can emit in a single frame (see "Override contract" below for how runtime dims interact with packaged values).

Packed wire format

The uint8 buffer is laid out little-endian:

offset size content
------ ---- -------
0 4 uint32 N = number of valid detections in this frame
4 24 RawBox[0]
28 24 RawBox[1]
. . ...
. . RawBox[N-1]
(trailing bytes up to buffer capacity are padding, ignored)

Each RawBox record is 24 bytes:

Offset in recordSizeTypeFieldMeaning
04int32xtop-left x, in source pixels
44int32ytop-left y, in source pixels
84int32wwidth, in source pixels
124int32hheight, in source pixels
164float32scorepost-NMS detection confidence in [0.0, 1.0] (the value detection_threshold gates on)
204int32class_idpredicted class id (model-defined; 0-indexed; class-name map lives in the model archive metadata)

The canonical Python struct format matching one record is "<iiiifi" (little-endian, 4 signed ints, one float, one signed int).

The runtime's parsing helpers (parse_bbox_bytes / decode_bbox_tensor in include/pipeline/DetectionTypes.h, tests/unit_testing/unit_detection_types_bbox_test.cpp pins the wire contract) expand each RawBox into a Box struct for downstream code:

struct Box {
float x1, y1, x2, y2; // x2 = x + w, y2 = y + h; clamped to [0, img_w|h]
float score;
int class_id;
};

Coordinate space

Coordinates decoded from BBOX are in original-image pixels, the same coordinate system you passed as original_width / original_height (or that the model archive was packaged with). They are not normalized to [0, 1], and they are not expressed in the model's internal letterboxed input space. The parser clamps (x1, y1, x2, y2) to [0, original_width] / [0, original_height] so caller code can draw them directly on the source frame.

Worked example

With the tutorial's runtime configuration (original_width = 640, original_height = 640, top_k = 100) and the stock YOLOv8 pack (buffers.input[0].size = 20160 in the boxdecode config), a single decoded frame yields:

  • out.kind == SampleKind.Tensor
  • out.payload_tag == "BBOX"
  • out.tensor.dtype == UInt8, out.tensor.shape == [20160]
  • Bytes [0:4] give N in little-endian; 0 <= N <= 100 because top_k = 100. An N of 0 means "no detections above threshold this frame" — iterate zero times and emit nothing.
  • Bytes [4 : 4 + 24 * N] hold the valid detections; everything after that offset is zero/padding and must be ignored.

Reading a box in Python is a struct.unpack_from:

import struct
payload = out.tensor.copy_payload_bytes()
count = struct.unpack_from("<I", payload, 0)[0]
for i in range(count):
x, y, w, h, score, cls = struct.unpack_from("<iiiifi", payload, 4 + 24 * i)
# (x, y, w, h) in source pixels; x2 = x + w, y2 = y + h

In C++ the stages::BoxDecode helper returns a BoxDecodeResult that has already done this unpack for you: result.boxes[i] is a Box with (x1, y1, x2, y2) already populated from (x, y, x+w, y+h) and clamped to the image.

Override contract: runtime dims vs. packaged model archive defaults

SimaBoxDecode is constructed from a trained model archive that ships with packaged defaults for decode_type, detection_threshold, nms_iou_threshold, top_k, original_width, and original_height. The public constructor

SimaBoxDecode(const Model& model,
const std::string& decode_type = "",
int original_width = 0, int original_height = 0,
double detection_threshold = 0.0,
double nms_iou_threshold = 0.0,
int top_k = 0);

and its Python twin pyneat.nodes.sima_box_decode(model, ...) use a simple "positive overrides, zero/empty preserves" rule per field.

Naming note. detection_threshold is the name used by SimaBoxDecode's constructor. ModelOptions.score_threshold (used in the Python tutorial) is plumbed into that same argument. The two names refer to the same underlying control.

Runtime argumentValue passedBehavior
decode_type"" (empty)preserve model archive / model-path inference
decode_typenon-empty stringoverride the model archive value for this run
original_width / original_height0preserve model archive packaged dimension
original_width / original_heightpositive intrewrite original_width / original_height in the effective config
detection_threshold0.0preserve model archive packaged threshold
detection_threshold> 0.0override (also triggers the YOLOv8 cliff-warning below)
nms_iou_threshold0.0preserve model archive packaged NMS IoU
nms_iou_threshold> 0.0override
top_k0preserve model archive packaged top-K
top_k> 0override

The rule is strictly per-field:

  • Python path — the tutorial overrides every field because ModelOptions sets positive values.
  • C++ pathread_detection_boxes.cpp passes 0.55f, 0.5f, 100 (so detection_threshold, nms_iou_threshold, and top_k are overridden) plus bgr.cols, bgr.rows positively (so original_width / original_height are overridden too).

Practical consequences:

  • If your model archive was packed for a different resolution than your source frames, pass original_width and original_height explicitly so coordinates land in source pixels.
  • Leaving detection_threshold and nms_iou_threshold at 0.0 is the safest way to get the model archive's validated defaults; only override when you are deliberately retuning.
  • Be deliberate with a low detection_threshold. The lower it is, the more candidate boxes survive thresholding, and NMS cost grows with the square of the surviving-box count — so a very low threshold can sharply increase postprocess compute and latency. Lower it only as far as you need to catch weak detections; pair it with top_k to cap the worst case.

Decode types and tensor contracts

BoxDecodeType is a typed API (simaai::neat::BoxDecodeType / neat.BoxDecodeType) and should always be set explicitly for decode stages. The runtime contract below comes from internals/gst_plugins/genericboxdecode_v2/gstneatboxdecode.cpp (infer_num_classes, infer_yolo_decoupled_classes, infer_yolo_packed_classes, compute_required_output_size).

Core tensor contract rules:

  • YOLO-family decode types (yolo, yolov5*, yolov7*, yolov8*, yolov9*, yolov10*):
    • Decoupled heads: class-head depths must be repeatable and > 4.
    • Packed heads: each head depth must satisfy depth = 3 * (num_classes + 5) and be consistent across heads.
  • yolo26: decoupled grouped heads with 4-channel raw l/t/r/b bbox tensors and repeatable class-head depths > 4.
  • detr: class channels are inferred from the maximum depth across heads, and must be > 4.
  • Other non-YOLO decode types (effdet, rcnn-stage1, centernet): fallback class inference uses max depth and requires > 4.
  • Segmentation decode tokens (*-seg) enable segmentation-like output sizing in v2 (adds mask payload per detection).
API enumBackend tokenExpected contract
BoxDecodeType::YoloyoloYOLO decoupled or packed depth contract
BoxDecodeType::YoloV5yolov5YOLO decoupled or packed depth contract
BoxDecodeType::YoloV5Segyolov5-segYOLO depth contract + segmentation path
BoxDecodeType::YoloV7yolov7YOLO decoupled or packed depth contract
BoxDecodeType::YoloV7Segyolov7-segYOLO depth contract + segmentation path
BoxDecodeType::YoloV8yolov8YOLO decoupled or packed depth contract
BoxDecodeType::YoloV8Segyolov8-segYOLO depth contract + segmentation path
BoxDecodeType::YoloV8Poseyolov8-poseYOLO decoupled or packed depth contract
BoxDecodeType::YoloV9yolov9YOLO decoupled or packed depth contract
BoxDecodeType::YoloV9Segyolov9-segYOLO depth contract + segmentation path
BoxDecodeType::YoloV10yolov10YOLO decoupled or packed depth contract
BoxDecodeType::YoloV10Segyolov10-segYOLO depth contract + segmentation path
BoxDecodeType::YoloV26yolo26YOLO26 grouped raw l/t/r/b bbox heads + class-score heads
BoxDecodeType::Detrdetrnum_classes = max(depth) (must be > 4)
BoxDecodeType::EffDeteffdetfallback max-depth inference (> 4)
BoxDecodeType::RcnnStage1rcnn-stage1fallback max-depth inference (> 4)
BoxDecodeType::Centernetcenternetfallback max-depth inference (> 4)

Fail-fast behavior:

  • stages::BoxDecodeOptions requires explicit construction with a decode type.
  • stages::BoxDecode(...) and nodes::SimaBoxDecode(...) fail fast on BoxDecodeType::Unspecified.

Setting the decode type explicitly:

simaai::neat::stages::BoxDecodeOptions opt(simaai::neat::BoxDecodeType::YoloV8);
opt.detection_threshold = 0.25;
opt.nms_iou_threshold = 0.5;
opt.top_k = 100;
opt = neat.ModelOptions()
opt.decode_type = neat.BoxDecodeType.YoloV8

Full source

Show the complete C++ and Python programs
tutorials/007_read_detection_boxes/read_detection_boxes.cpp
// Decompose model execution into stages: Preproc -> Infer -> BoxDecode.
//
// Usage:
// tutorial_007_read_detection_boxes --model /path/to/yolo_v8s.tar.gz --image /path/to.jpg

#include "neat.h"

#include "pipeline/StageRun.h"

#include <opencv2/imgcodecs.hpp>

#include <iostream>
#include <stdexcept>
#include <string>

namespace {

bool get_arg(int argc, char** argv, const std::string& key, std::string& out) {
for (int i = 1; i + 1 < argc; ++i) {
if (key == argv[i]) {
out = argv[i + 1];
return true;
}
}
return false;
}

} // namespace

int main(int argc, char** argv) {
try {
std::string model_path, image;
if (!get_arg(argc, argv, "--model", model_path) || !get_arg(argc, argv, "--image", image)) {
std::cerr << "Usage: tutorial_007_read_detection_boxes --model <path> --image <path>\n";
return 1;
}

cv::Mat bgr = cv::imread(image, cv::IMREAD_COLOR);
if (bgr.empty())
throw std::runtime_error("failed to load image: " + image);

simaai::neat::Model::Options opt;
opt.preprocess.color_convert.input_format = simaai::neat::PreprocessColorFormat::BGR;
opt.preprocess.input_max_width = bgr.cols;
opt.preprocess.input_max_height = bgr.rows;
opt.preprocess.input_max_depth = bgr.channels();
opt.decode_type = simaai::neat::BoxDecodeType::YoloV8;

simaai::neat::Model model(model_path, opt);

// CORE LOGIC
// Stage-by-stage: each stages::* call runs one piece of the model pipeline.
simaai::neat::TensorList pre = simaai::neat::stages::Preproc(std::vector<cv::Mat>{bgr}, model);
simaai::neat::Sample infer_samples = simaai::neat::stages::Infer(
simaai::neat::Sample{simaai::neat::sample_from_tensors(pre)}, model);
if (infer_samples.empty())
throw std::runtime_error("infer stage returned no samples");
simaai::neat::Sample infer = infer_samples.front();

simaai::neat::stages::BoxDecodeOptions box(simaai::neat::BoxDecodeType::YoloV8);
(void)box.decode_type;
(void)bgr.cols;
(void)bgr.rows;
box.detection_threshold = 0.55;
box.nms_iou_threshold = 0.5;
box.top_k = 100;

// BoxDecode parses the "BBOX" tensor into {x1, y1, x2, y2, score, class_id}
// entries clamped to original_width x original_height source pixels.
simaai::neat::BoxDecodeResultList decoded_results =
simaai::neat::stages::BoxDecodeResults(simaai::neat::Sample{infer}, model, box);
if (decoded_results.empty())
throw std::runtime_error("boxdecode result parser returned no results");
const simaai::neat::BoxDecodeResult& decoded = decoded_results.front();

std::cout << "boxes=" << decoded.boxes.size() << "\n";
std::cout << "[OK] 007_read_detection_boxes\n";
return 0;
} catch (const std::exception& e) {
std::cerr << "[FAIL] " << e.what() << "\n";
return 1;
}
}

Source