Serve GenAI Models
| Field | Value |
|---|---|
| Difficulty | Intermediate |
| Estimated Read Time | 15-20 minutes |
| Labels | genai, server, llm, vlm, asr, http |
Direct model.run(request) is the best starting point for embedded application logic. Use GenAIServer when the application boundary is HTTP: a browser UI, a companion service, or a remote client that should not link against the Neat runtime.
Walkthrough
Configure the server
Choose the host and port. The default host is 0.0.0.0, which accepts connections from other machines that can reach the Modalix device.
genai::GenAIServerOptions options;
options.host = args.host;
options.port = args.port;
genai::GenAIServer server(options);
Register model directories
Add each deployed model directory with a served name. This tutorial registers llm, vlm, and asr; the served name is what clients send in the model field.
if (!args.llm.empty()) {
server.add_model(args.llm, "llm");
}
if (!args.vlm.empty()) {
server.add_model(args.vlm, "vlm");
}
if (!args.asr.empty()) {
server.add_model(args.asr, "asr");
}
std::cout << "registered models:";
for (const auto& name : server.model_names()) {
std::cout << " " << name;
}
std::cout << "\n";
Start serving
Call serve() for a blocking foreground process or start() when your application owns the rest of the process lifetime.
After the server starts, verify the registered model names with GET /v1/models:
curl http://<modalix-ip>:9998/v1/models
The response should include the served names registered in this tutorial: llm, vlm, and asr.
std::cout << "serving on http://" << options.host << ":" << options.port << "\n";
std::cout << "try: curl http://<modalix-ip>:" << options.port << "/v1/models\n";
server.serve();
Run
First, download the LLM, VLM, and ASR models from Hugging Face using the LLiMa CLI:
llima pull Qwen3-4B-Instruct-2507-GPTQ-a16w4
llima pull Qwen3-VL-4B-Instruct-GPTQ-a16w4
llima pull whisper-small-a16w8
Start the server on Modalix with all three deployed model directories:
C++ (prebuilt):
./lib/sima-neat/tutorials/tutorial_021_serve_genai_models \
--llm /media/nvme/llima/models/Qwen3-4B-Instruct-2507-GPTQ-a16w4 \
--vlm /media/nvme/llima/models/Qwen3-VL-4B-Instruct-GPTQ-a16w4 \
--asr /media/nvme/llima/models/whisper-small-a16w8
C++ (build from source):
./build.sh --target tutorial_021_serve_genai_models
./build/tutorials-standalone/tutorial_021_serve_genai_models \
--llm /media/nvme/llima/models/Qwen3-4B-Instruct-2507-GPTQ-a16w4 \
--vlm /media/nvme/llima/models/Qwen3-VL-4B-Instruct-GPTQ-a16w4 \
--asr /media/nvme/llima/models/whisper-small-a16w8
Drop --vlm or --asr if you only want to serve a subset during development.
After the server is running, first verify that all served names are registered:
curl http://<modalix-ip>:9998/v1/models
Then call the endpoints from a client. Replace <modalix-ip> with the IP address or hostname of your Modalix device.
The request clients below use Python requests, stream the response, and print server-side TTFT/TPS when reported.
Text request to the LLM
python3 share/sima-neat/tutorials/021_serve_genai_models/request_chat_completion_text.py \
--server-ip <modalix-ip> \
--model llm \
"Give me three tips for designing a small REST API."
Text and image request to the VLM
The request script base64-encodes the image and sends it as an OpenAI-compatible image_url content part.
python3 share/sima-neat/tutorials/021_serve_genai_models/request_chat_completion_image.py \
--server-ip <modalix-ip> \
--model vlm \
image.jpg \
"What is the main subject of this image?"
Audio request to the ASR model
python3 share/sima-neat/tutorials/021_serve_genai_models/request_audio_transcription.py \
--server-ip <modalix-ip> \
--model asr \
speech.wav
In Practice
Use the server when a network boundary is useful. Use direct GenAIModel, VisionLanguageModel, and ASRModel calls for lower-overhead application code inside the same process.
The /v1/models endpoint is the quickest smoke check: if it returns the served names, the server is reachable and the model registry is populated.
Full source
Show the complete C++ and Python programs
#include "neat/genai.h"
#include <cstdint>
#include <filesystem>
#include <iostream>
#include <stdexcept>
#include <string>
#include <vector>
namespace genai = simaai::neat::genai;
struct Args {
std::string host = "0.0.0.0";
std::uint16_t port = 9998;
std::filesystem::path llm;
std::filesystem::path vlm;
std::filesystem::path asr;
};
Args parse_args(int argc, char** argv) {
Args args;
for (int i = 1; i < argc; ++i) {
const std::string arg = argv[i];
if (arg == "--host" && i + 1 < argc) {
args.host = argv[++i];
} else if (arg == "--port" && i + 1 < argc) {
args.port = static_cast<std::uint16_t>(std::stoi(argv[++i]));
} else if (arg == "--llm" && i + 1 < argc) {
args.llm = argv[++i];
} else if (arg == "--vlm" && i + 1 < argc) {
args.vlm = argv[++i];
} else if (arg == "--asr" && i + 1 < argc) {
args.asr = argv[++i];
} else {
throw std::runtime_error("usage: serve_genai_models [--host <host>] [--port <port>] "
"[--llm <dir>] [--vlm <dir>] [--asr <dir>]");
}
}
if (args.llm.empty() && args.vlm.empty() && args.asr.empty()) {
throw std::runtime_error("provide at least one of --llm <dir>, --vlm <dir>, or --asr <dir>");
}
return args;
}
int main(int argc, char** argv) {
try {
const Args args = parse_args(argc, argv);
genai::GenAIServerOptions options;
options.host = args.host;
options.port = args.port;
genai::GenAIServer server(options);
if (!args.llm.empty()) {
server.add_model(args.llm, "llm");
}
if (!args.vlm.empty()) {
server.add_model(args.vlm, "vlm");
}
if (!args.asr.empty()) {
server.add_model(args.asr, "asr");
}
std::cout << "registered models:";
for (const auto& name : server.model_names()) {
std::cout << " " << name;
}
std::cout << "\n";
std::cout << "serving on http://" << options.host << ":" << options.port << "\n";
std::cout << "try: curl http://<modalix-ip>:" << options.port << "/v1/models\n";
server.serve();
return 0;
} catch (const std::exception& e) {
std::cerr << "error: " << e.what() << "\n";
return 1;
}
}