Skip to main content

Serve GenAI Models

Serve GenAI Models — animated walkthrough overview

FieldValue
DifficultyIntermediate
Estimated Read Time15-20 minutes
Labelsgenai, server, llm, vlm, asr, http

Direct model.run(request) is the best starting point for embedded application logic. Use GenAIServer when the application boundary is HTTP: a browser UI, a companion service, or a remote client that should not link against the Neat runtime.

Walkthrough

Configure the server

Choose the host and port. The default host is 0.0.0.0, which accepts connections from other machines that can reach the Modalix device.

tutorials/021_serve_genai_models/serve_genai_models.cpp
genai::GenAIServerOptions options;
options.host = args.host;
options.port = args.port;
genai::GenAIServer server(options);

Register model directories

Add each deployed model directory with a served name. This tutorial registers llm, vlm, and asr; the served name is what clients send in the model field.

tutorials/021_serve_genai_models/serve_genai_models.cpp
if (!args.llm.empty()) {
server.add_model(args.llm, "llm");
}
if (!args.vlm.empty()) {
server.add_model(args.vlm, "vlm");
}
if (!args.asr.empty()) {
server.add_model(args.asr, "asr");
}

std::cout << "registered models:";
for (const auto& name : server.model_names()) {
std::cout << " " << name;
}
std::cout << "\n";

Start serving

Call serve() for a blocking foreground process or start() when your application owns the rest of the process lifetime.

After the server starts, verify the registered model names with GET /v1/models:

curl http://<modalix-ip>:9998/v1/models

The response should include the served names registered in this tutorial: llm, vlm, and asr.

tutorials/021_serve_genai_models/serve_genai_models.cpp
std::cout << "serving on http://" << options.host << ":" << options.port << "\n";
std::cout << "try: curl http://<modalix-ip>:" << options.port << "/v1/models\n";
server.serve();

Run

First, download the LLM, VLM, and ASR models from Hugging Face using the LLiMa CLI:

llima pull Qwen3-4B-Instruct-2507-GPTQ-a16w4
llima pull Qwen3-VL-4B-Instruct-GPTQ-a16w4
llima pull whisper-small-a16w8

Start the server on Modalix with all three deployed model directories:

C++ (prebuilt):

./lib/sima-neat/tutorials/tutorial_021_serve_genai_models \
--llm /media/nvme/llima/models/Qwen3-4B-Instruct-2507-GPTQ-a16w4 \
--vlm /media/nvme/llima/models/Qwen3-VL-4B-Instruct-GPTQ-a16w4 \
--asr /media/nvme/llima/models/whisper-small-a16w8

C++ (build from source):

./build.sh --target tutorial_021_serve_genai_models
./build/tutorials-standalone/tutorial_021_serve_genai_models \
--llm /media/nvme/llima/models/Qwen3-4B-Instruct-2507-GPTQ-a16w4 \
--vlm /media/nvme/llima/models/Qwen3-VL-4B-Instruct-GPTQ-a16w4 \
--asr /media/nvme/llima/models/whisper-small-a16w8

Drop --vlm or --asr if you only want to serve a subset during development.

After the server is running, first verify that all served names are registered:

curl http://<modalix-ip>:9998/v1/models

Then call the endpoints from a client. Replace <modalix-ip> with the IP address or hostname of your Modalix device. The request clients below use Python requests, stream the response, and print server-side TTFT/TPS when reported.

Text request to the LLM

python3 share/sima-neat/tutorials/021_serve_genai_models/request_chat_completion_text.py \
--server-ip <modalix-ip> \
--model llm \
"Give me three tips for designing a small REST API."

Text and image request to the VLM

The request script base64-encodes the image and sends it as an OpenAI-compatible image_url content part.

python3 share/sima-neat/tutorials/021_serve_genai_models/request_chat_completion_image.py \
--server-ip <modalix-ip> \
--model vlm \
image.jpg \
"What is the main subject of this image?"

Audio request to the ASR model

python3 share/sima-neat/tutorials/021_serve_genai_models/request_audio_transcription.py \
--server-ip <modalix-ip> \
--model asr \
speech.wav

In Practice

Use the server when a network boundary is useful. Use direct GenAIModel, VisionLanguageModel, and ASRModel calls for lower-overhead application code inside the same process.

The /v1/models endpoint is the quickest smoke check: if it returns the served names, the server is reachable and the model registry is populated.

Full source

Show the complete C++ and Python programs
tutorials/021_serve_genai_models/serve_genai_models.cpp
#include "neat/genai.h"

#include <cstdint>
#include <filesystem>
#include <iostream>
#include <stdexcept>
#include <string>
#include <vector>

namespace genai = simaai::neat::genai;

struct Args {
std::string host = "0.0.0.0";
std::uint16_t port = 9998;
std::filesystem::path llm;
std::filesystem::path vlm;
std::filesystem::path asr;
};

Args parse_args(int argc, char** argv) {
Args args;
for (int i = 1; i < argc; ++i) {
const std::string arg = argv[i];
if (arg == "--host" && i + 1 < argc) {
args.host = argv[++i];
} else if (arg == "--port" && i + 1 < argc) {
args.port = static_cast<std::uint16_t>(std::stoi(argv[++i]));
} else if (arg == "--llm" && i + 1 < argc) {
args.llm = argv[++i];
} else if (arg == "--vlm" && i + 1 < argc) {
args.vlm = argv[++i];
} else if (arg == "--asr" && i + 1 < argc) {
args.asr = argv[++i];
} else {
throw std::runtime_error("usage: serve_genai_models [--host <host>] [--port <port>] "
"[--llm <dir>] [--vlm <dir>] [--asr <dir>]");
}
}
if (args.llm.empty() && args.vlm.empty() && args.asr.empty()) {
throw std::runtime_error("provide at least one of --llm <dir>, --vlm <dir>, or --asr <dir>");
}
return args;
}

int main(int argc, char** argv) {
try {
const Args args = parse_args(argc, argv);

genai::GenAIServerOptions options;
options.host = args.host;
options.port = args.port;
genai::GenAIServer server(options);

if (!args.llm.empty()) {
server.add_model(args.llm, "llm");
}
if (!args.vlm.empty()) {
server.add_model(args.vlm, "vlm");
}
if (!args.asr.empty()) {
server.add_model(args.asr, "asr");
}

std::cout << "registered models:";
for (const auto& name : server.model_names()) {
std::cout << " " << name;
}
std::cout << "\n";

std::cout << "serving on http://" << options.host << ":" << options.port << "\n";
std::cout << "try: curl http://<modalix-ip>:" << options.port << "/v1/models\n";
server.serve();

return 0;
} catch (const std::exception& e) {
std::cerr << "error: " << e.what() << "\n";
return 1;
}
}

Source