Run a VLM
| Field | Value |
|---|---|
| Difficulty | Beginner |
| Estimated Read Time | 10-15 minutes |
| Labels | genai, vlm, image, cache, multimodal |
Vision-language models can accept text plus image tensors. For one question, attach the image directly to GenerationRequest.images. For repeated questions, encode the image once and reuse the cached image embeddings in follow-up requests.
Walkthrough
Load the VLM and image
Load a VisionLanguageModel from a deployed LLiMa model directory and decode an image from disk.
Use OpenCV to read the image. Neat treats three-channel cv::Mat inputs as BGR and converts them to RGB internally.
genai::VisionLanguageModel model(args.model);
cv::Mat image = cv::imread(args.image.string(), cv::IMREAD_COLOR);
if (image.empty()) {
throw std::runtime_error("failed to read image: " + args.image.string());
}
Ask with a direct image
Attach the image directly to the first request. This is the simplest path and is often enough for one-shot visual questions.
genai::GenerationRequest direct;
direct.prompt = "Describe this image in one sentence.";
direct.images = {image};
direct.max_new_tokens = 96;
const genai::GenerationResult first = model.run(direct);
std::cout << "direct image: " << first.text << "\n\n";
Cache the image embedding
Call encode(...) to cache image embeddings in the model. The call returns true when the image was accepted and cached.
if (!model.encode(image)) {
throw std::runtime_error("VLM did not accept the image for caching");
}
std::cout << "cached_images=" << model.cached_image_count() << "\n";
Ask follow-up questions
Set use_cached_images = true on each request that should reuse the cached image. You can ask multiple questions about the same cached image. Requests without that flag behave normally: text-only requests use no image, direct-image requests use their own images, and another encode(...) call replaces the cached image.
genai::GenerationRequest cached;
cached.prompt = "What details should I inspect more closely?";
cached.use_cached_images = true;
cached.max_new_tokens = 96;
const genai::GenerationResult follow_up = model.run(cached);
std::cout << "cached image: " << follow_up.text << "\n\n";
genai::GenerationRequest second_cached;
second_cached.prompt = "Summarize the image in three keywords.";
second_cached.use_cached_images = true;
second_cached.max_new_tokens = 48;
const genai::GenerationResult second_follow_up = model.run(second_cached);
std::cout << "cached image keywords: " << second_follow_up.text << "\n\n";
Attach an image to a chat message
When you use messages, attach images to the user message that needs them. This keeps the image next to the exact text it belongs to.
genai::ChatMessage image_message;
image_message.role = "user";
image_message.content = "What is the main subject of this image?";
image_message.images = {image};
genai::GenerationRequest message_request;
message_request.messages = {image_message};
message_request.max_new_tokens = 96;
const genai::GenerationResult message_result = model.run(message_request);
std::cout << "message image: " << message_result.text << "\n";
Run
First, download a VLM such as Qwen3-VL 4B from Hugging Face using the LLiMa CLI:
llima pull Qwen3-VL-4B-Instruct-GPTQ-a16w4
Run the tutorial on Modalix with the deployed model directory and a local image:
C++ (prebuilt):
./lib/sima-neat/tutorials/tutorial_020_run_a_vlm \
--model /media/nvme/llima/models/Qwen3-VL-4B-Instruct-GPTQ-a16w4 \
--image tests/images/people.jpg
C++ (build from source):
./build.sh --target tutorial_020_run_a_vlm
./build/tutorials-standalone/tutorial_020_run_a_vlm \
--model /media/nvme/llima/models/Qwen3-VL-4B-Instruct-GPTQ-a16w4 \
--image tests/images/people.jpg
Expected output is one answer from a direct image request, multiple follow-up answers that reuse the cached image, and one answer from a message-level image request.
In Practice
Use image caching when the user asks several questions about the same frame, product image, diagram, or document page. Avoid caching when each request uses a different image because the direct-image path is simpler and keeps prompt state obvious.
Some model families may not support cached reuse. In that case, use direct images on each request.
Use ChatMessage.images when you are building a conversation and only one message should carry the image. Use top-level GenerationRequest.images for the simpler one-prompt shape.
Full source
Show the complete C++ and Python programs
#include "neat/genai.h"
#include <opencv2/imgcodecs.hpp>
#include <filesystem>
#include <iostream>
#include <stdexcept>
#include <string>
namespace genai = simaai::neat::genai;
struct Args {
std::filesystem::path model;
std::filesystem::path image;
};
Args parse_args(int argc, char** argv) {
Args args;
for (int i = 1; i < argc; ++i) {
const std::string arg = argv[i];
if (arg == "--model" && i + 1 < argc) {
args.model = argv[++i];
} else if (arg == "--image" && i + 1 < argc) {
args.image = argv[++i];
} else {
throw std::runtime_error("usage: run_a_vlm --model <vlm_model_dir> --image <image>");
}
}
if (args.model.empty() || args.image.empty()) {
throw std::runtime_error("missing required --model <vlm_model_dir> or --image <image>");
}
return args;
}
int main(int argc, char** argv) {
try {
const Args args = parse_args(argc, argv);
genai::VisionLanguageModel model(args.model);
cv::Mat image = cv::imread(args.image.string(), cv::IMREAD_COLOR);
if (image.empty()) {
throw std::runtime_error("failed to read image: " + args.image.string());
}
genai::GenerationRequest direct;
direct.prompt = "Describe this image in one sentence.";
direct.images = {image};
direct.max_new_tokens = 96;
const genai::GenerationResult first = model.run(direct);
std::cout << "direct image: " << first.text << "\n\n";
if (!model.encode(image)) {
throw std::runtime_error("VLM did not accept the image for caching");
}
std::cout << "cached_images=" << model.cached_image_count() << "\n";
genai::GenerationRequest cached;
cached.prompt = "What details should I inspect more closely?";
cached.use_cached_images = true;
cached.max_new_tokens = 96;
const genai::GenerationResult follow_up = model.run(cached);
std::cout << "cached image: " << follow_up.text << "\n\n";
genai::GenerationRequest second_cached;
second_cached.prompt = "Summarize the image in three keywords.";
second_cached.use_cached_images = true;
second_cached.max_new_tokens = 48;
const genai::GenerationResult second_follow_up = model.run(second_cached);
std::cout << "cached image keywords: " << second_follow_up.text << "\n\n";
genai::ChatMessage image_message;
image_message.role = "user";
image_message.content = "What is the main subject of this image?";
image_message.images = {image};
genai::GenerationRequest message_request;
message_request.messages = {image_message};
message_request.max_new_tokens = 96;
const genai::GenerationResult message_result = model.run(message_request);
std::cout << "message image: " << message_result.text << "\n";
return 0;
} catch (const std::exception& e) {
std::cerr << "error: " << e.what() << "\n";
return 1;
}
}