Skip to main content

Run an LLM

Run an LLM — animated walkthrough overview

FieldValue
DifficultyBeginner
Estimated Read Time10 minutes
Labelsgenai, llm, chat, history, streaming

The classic Model tutorials use .tar.gz MPK archives. GenAI models use LLiMa model directories and the neat::genai API instead. Start with the smallest request: load a model, set request.prompt, run it, and print the answer. Once that works, switch to request.messages when you need conversation state.

Walkthrough

Load the model directory

Point GenAIModel at a deployed LLiMa model directory. This tutorial uses GenAIModel because it auto-detects whether the directory is an LLM, VLM, or ASR model.

Construct simaai::neat::genai::GenAIModel from the model path.

tutorials/019_run_an_llm/run_an_llm.cpp
genai::GenAIModel model(args.model);

Send one prompt

Build a GenerationRequest with prompt and a token budget. This is the shortest path for one-off questions, tests, and scripts.

tutorials/019_run_an_llm/run_an_llm.cpp
genai::GenerationRequest request;
request.prompt = "Give me three practical tips for designing a small REST API.";
request.max_new_tokens = 96;

const genai::GenerationResult first = model.run(request);
std::cout << "assistant: " << first.text << "\n\n";

Define a system prompt

Use a short system instruction to steer the model's behavior. You can attach it to a simple prompt request with system_prompt; when you switch to chat history, carry the same instruction into the message list as a system message.

tutorials/019_run_an_llm/run_an_llm.cpp
const std::string system_prompt = "You are concise and practical.";

genai::GenerationRequest concise_request;
concise_request.system_prompt = system_prompt;
concise_request.prompt = "Give me one rule of thumb for designing a small REST API.";
concise_request.max_new_tokens = 64;

const genai::GenerationResult concise = model.run(concise_request);
std::cout << "assistant: " << concise.text << "\n\n";

Switch to messages

For chat-style requests, use messages instead of prompt: start with a system message and a user message, run the request, then store the assistant response. The model does not remember earlier run() calls by itself; your application owns the message history.

tutorials/019_run_an_llm/run_an_llm.cpp
std::vector<genai::ChatMessage> messages;
messages.push_back(genai::ChatMessage{
.role = "system",
.content = system_prompt,
});
messages.push_back(genai::ChatMessage{
.role = "user",
.content = "Give me three practical tips for writing API documentation.",
});

genai::GenerationRequest chat_request;
chat_request.messages = messages;
chat_request.max_new_tokens = 96;

const genai::GenerationResult chat_result = model.run(chat_request);
std::cout << "assistant: " << chat_result.text << "\n\n";
messages.push_back(genai::ChatMessage{.role = "assistant", .content = chat_result.text});

Ask a follow-up with history

Append another user message, send the updated message list, and read the answer. The model now sees the full conversation your application kept.

tutorials/019_run_an_llm/run_an_llm.cpp
messages.push_back(genai::ChatMessage{
.role = "user",
.content = "Which tip should I apply first for a prototype?",
});

genai::GenerationRequest follow_up;
follow_up.messages = messages;
follow_up.max_new_tokens = 96;

const genai::GenerationResult second = model.run(follow_up);
std::cout << "assistant: " << second.text << "\n\n";
messages.push_back(genai::ChatMessage{.role = "assistant", .content = second.text});

Stream an answer

For UI-style output, call stream() and iterate the returned GenerationStream. Each token sample contains the latest text fragment.

tutorials/019_run_an_llm/run_an_llm.cpp
messages.push_back(genai::ChatMessage{
.role = "user",
.content = "Turn that advice into a short checklist.",
});

genai::GenerationRequest streaming_request;
streaming_request.messages = messages;
streaming_request.max_new_tokens = 96;

genai::GenerationStream stream_handle = model.stream(streaming_request);
std::cout << "assistant: ";
for (const genai::TokenSample& token : stream_handle) {
std::cout << token.text << std::flush;
}
std::cout << "\n";

Run

First, download an LLM such as Qwen3 4B from Hugging Face using the LLiMa CLI:

llima pull Qwen3-4B-Instruct-2507-GPTQ-a16w4

Run the tutorial on Modalix with the deployed model directory:

C++ (prebuilt):

./lib/sima-neat/tutorials/tutorial_019_run_an_llm \
--model /media/nvme/llima/models/Qwen3-4B-Instruct-2507-GPTQ-a16w4

C++ (build from source):

./build.sh --target tutorial_019_run_an_llm
./build/tutorials-standalone/tutorial_019_run_an_llm \
--model /media/nvme/llima/models/Qwen3-4B-Instruct-2507-GPTQ-a16w4

Expected output is a simple prompt answer, a system-prompted answer, a context-aware follow-up, and a streamed final response.

In Practice

Keep only the amount of message history your application needs. Long histories consume context tokens and increase time to first token. For persistent chat applications, store the conversation outside the model object and rebuild GenerationRequest.messages for each turn.

Full source

Show the complete C++ and Python programs
tutorials/019_run_an_llm/run_an_llm.cpp
#include "neat/genai.h"

#include <filesystem>
#include <iostream>
#include <stdexcept>
#include <string>
#include <vector>

namespace genai = simaai::neat::genai;

struct Args {
std::filesystem::path model;
};

Args parse_args(int argc, char** argv) {
Args args;
for (int i = 1; i < argc; ++i) {
const std::string arg = argv[i];
if (arg == "--model" && i + 1 < argc) {
args.model = argv[++i];
} else {
throw std::runtime_error("usage: run_an_llm --model <llima_model_dir>");
}
}
if (args.model.empty()) {
throw std::runtime_error("missing required --model <llima_model_dir>");
}
return args;
}

int main(int argc, char** argv) {
try {
const Args args = parse_args(argc, argv);

genai::GenAIModel model(args.model);

genai::GenerationRequest request;
request.prompt = "Give me three practical tips for designing a small REST API.";
request.max_new_tokens = 96;

const genai::GenerationResult first = model.run(request);
std::cout << "assistant: " << first.text << "\n\n";

const std::string system_prompt = "You are concise and practical.";

genai::GenerationRequest concise_request;
concise_request.system_prompt = system_prompt;
concise_request.prompt = "Give me one rule of thumb for designing a small REST API.";
concise_request.max_new_tokens = 64;

const genai::GenerationResult concise = model.run(concise_request);
std::cout << "assistant: " << concise.text << "\n\n";

std::vector<genai::ChatMessage> messages;
messages.push_back(genai::ChatMessage{
.role = "system",
.content = system_prompt,
});
messages.push_back(genai::ChatMessage{
.role = "user",
.content = "Give me three practical tips for writing API documentation.",
});

genai::GenerationRequest chat_request;
chat_request.messages = messages;
chat_request.max_new_tokens = 96;

const genai::GenerationResult chat_result = model.run(chat_request);
std::cout << "assistant: " << chat_result.text << "\n\n";
messages.push_back(genai::ChatMessage{.role = "assistant", .content = chat_result.text});

messages.push_back(genai::ChatMessage{
.role = "user",
.content = "Which tip should I apply first for a prototype?",
});

genai::GenerationRequest follow_up;
follow_up.messages = messages;
follow_up.max_new_tokens = 96;

const genai::GenerationResult second = model.run(follow_up);
std::cout << "assistant: " << second.text << "\n\n";
messages.push_back(genai::ChatMessage{.role = "assistant", .content = second.text});

messages.push_back(genai::ChatMessage{
.role = "user",
.content = "Turn that advice into a short checklist.",
});

genai::GenerationRequest streaming_request;
streaming_request.messages = messages;
streaming_request.max_new_tokens = 96;

genai::GenerationStream stream_handle = model.stream(streaming_request);
std::cout << "assistant: ";
for (const genai::TokenSample& token : stream_handle) {
std::cout << token.text << std::flush;
}
std::cout << "\n";

return 0;
} catch (const std::exception& e) {
std::cerr << "error: " << e.what() << "\n";
return 1;
}
}

Source