Inference Architecture
Inference Architecture
Overview
Inference in MLMP is provided by a single application server (duodecillion.ti.bfh.ch) in front of a pool of six Mac Studio machines that host the actual models. The application server runs Open WebUI and LiteLLM, which together handle authentication, access token management, request routing, and usage tracking. All Mac Studios run models via LM Studio.
At a high level:
- Open WebUI is the user-facing entry point, responsible for MFA and token issuance.
- LiteLLM is the single upstream endpoint configured in Open WebUI, responsible for model catalog, routing to inference backends, and usage tracking.
- LM Studio runs on all Mac Studios and serves the actual model inference.
Model Hosting Topology
At a high level, the topology looks like this:
flowchart LR
subgraph AppServer["Application Server: duodecillion.ti.bfh.ch"]
OW["Open WebUI"]
LL["LiteLLM"]
OW --> LL
end
subgraph LMS_CLUSTER["LM Studio Fleet (6× Mac Studio)"]
LM1["kalliope (LM Studio)<br/>minimax2.7"]
LM2["klio (LM Studio)<br/>qwen3.5"]
LM3["erato (LM Studio)<br/>gpt-oss:120b"]
LM4["euterpe (LM Studio)<br/>gpt-oss:120b"]
LM5["urania (LM Studio)<br/>nemotron"]
LM6["thalia (LM Studio)<br/>smaller hot-loaded models"]
end
LL --> LM1
LL --> LM2
LL --> LM3
LL --> LM4
LL --> LM5
LL --> LM6
Components
Open WebUI (on duodecillion.ti.bfh.ch)
- Responsibility:
- Provides the primary UI/API for end users.
- Handles multi-factor authentication (MFA).
- Issues and manages access tokens for users.
- Integration with LiteLLM:
- Configured with a single upstream endpoint, which is the LiteLLM service.
- After authenticating, users send their requests (with access tokens) to Open WebUI; Open WebUI forwards these requests to LiteLLM without exposing the underlying model hosts.
Why authentication is handled in Open WebUI
The free version of LiteLLM only supports MFA for up to five users. To support a larger internal user base with proper multi-factor authentication and token management, we use Open WebUI as a separate, dedicated authentication and access control component in front of LiteLLM.
MFA setup with SwitchID (placeholder)
Open WebUI is integrated with SwitchID to enforce multi-factor authentication for MLMP users.
The following env variables are set in the docker compose file:
- WEBUI_URL=https://inference.mlmp.ti.bfh.ch # defines the callback url
- OAUTH_CLIENT_ID=bfh_oidc_client_48647
- ENABLE_OAUTH_SIGNUP=True # enforces SSO
- OPENID_PROVIDER_URL=https://login.eduid.ch/.well-known/openid-configuration
- ENABLE_OAUTH_PERSISTENT_CONFIG=False
- ENABLE_LOGIN_FORM=False # MUST be set to False if ENABLE_OAUTH_SIGNUP is true
In addition, the following secret must be set in .env:
OAUTH_CLIENT_SECRET=
These information is shared with BF-HIT services (Contact: Olga Kurz), who configured the SSO endpoint.
LiteLLM (on duodecillion.ti.bfh.ch)
- Responsibility:
- Acts as the central routing layer between Open WebUI and all model backends.
- Maintains the model catalog (which models are available, and where).
- Handles usage tracking for requests per model / endpoint.
- Configuration:
- Configuration of LitelLM is done via the config.yml file located in the docker compose root folder. In the config file, system behaviour as well as resources (e.g., models) are defined.
- Fallback or failover logic is not yet implemented; routing is currently direct, based on the configured endpoints.
LM Studio (on Mac Studios)
- Responsibility:
- Runs on all six Mac Studios (
kalliope,klio,erato,euterpe,urania,thalia), providing model serving via LM Studio. - Current usage:
kalliope(LM Studio):- Dedicated to running
minimax2.7.
- Dedicated to running
klio(LM Studio):- Dedicated to running
qwen3.5.
- Dedicated to running
erato(LM Studio):- Dedicated to running
gpt-oss:120b.
- Dedicated to running
euterpe(LM Studio):- Dedicated to running
gpt-oss:120b.
- Dedicated to running
urania(LM Studio):- Dedicated to running
nemotron-3-super.
- Dedicated to running
thalia(LM Studio):- Hosts multiple smaller models that are acceptable to hot-load on demand.
- This node is optimized for flexibility rather than maximum single-model capacity.
- Details:
- Specific model set, hot-loading strategy, and LM Studio configuration are defined and maintained in the infrastructure/deployment repository.
Request Flow
Conceptually, the request flow from a user to a model and back is:
- User authenticates in Open WebUI (MFA) and obtains an access token.
- User sends a request (e.g. chat/completion) to Open WebUI, including the access token.
- Open WebUI forwards the request to the configured LiteLLM endpoint on
duodecillion.ti.bfh.ch. - LiteLLM looks up the requested model in its configuration and routes the request to the appropriate LM Studio backend node (for example
kalliopeforminimax2.7,klioforqwen3.5,erato/euterpeforgpt-oss:120b,uraniafornemotron-3-super, orthaliafor smaller hot-loaded models). - The selected Mac Studio performs inference and returns the response to LiteLLM.
- LiteLLM records usage, then returns the normalized response to Open WebUI.
- Open WebUI returns the final response to the user.
Sequence diagram (high level)
sequenceDiagram
participant U as User
participant OW as Open WebUI (duodecillion.ti.bfh.ch)
participant LL as LiteLLM (duodecillion.ti.bfh.ch)
participant B as Inference Backend (LM Studio)
U->>OW: Authenticate (MFA), obtain access token
U->>OW: Send request + access token
OW->>LL: Forward request to LiteLLM
LL->>B: Route request to selected model endpoint
B-->>LL: Model inference result
LL-->>OW: Normalized response (with usage tracking)
OW-->>U: Response to user
- Hostnames and deployment details (services, ports, resource limits, etc.) for all these nodes live in the infrastructure repository (Docker Compose files and associated scripts).
- Any future additions (new models, new Mac Studios) should be:
- Added to the LiteLLM configuration (model name, endpoint URL, metadata).
- Deployed via the standard Docker Compose / deployment pipeline used for LM Studio.
Data security
Logging of all request and response data must be deactivated on all subcomponents. Therefore, user data is never exposed to admins.
Openwebui
In the compose file, set the following envs to deactivate logging:
- GLOBAL_LOG_LEVEL=WARNING
- ENABLE_ADMIN_CHAT_ACCESS=False
- ENABLE_ADMIN_EXPORT=False
LiteLLM
In the compose file, set the following env: `LITELLM_LOG: "ERROR"``
In the config.yaml, set the following variables:
general_settings:
store_prompts_in_spend_logs: False
disable_error_logs: True
litellm_settings:
global_disable_no_log_param: True
turn_off_message_logging: True
json_logs: False
set_verbose: False
LM Studio
Message logging must be configured via the UI (Developer Logs tab at the bottom of the window) as follows:
- Verbose logging: off
- Redact Content: on
- Log Incoming Tokens: off
- File Logging Mode: off
Monitoring
All inference endpoints (LM Studio hosts) as well as Open WebUI and LiteLLM on duodecillion.ti.bfh.ch are monitored with Uptime Kuma.
For the metrics and dashboard logging path (LiteLLM metrics, Prometheus scrape, external Grafana access, and NGINX allowlisting), see Inference Logging and Metrics.
See Uptime Kuma Setup for details on the monitoring stack.