Inference Architecture

Overview

Inference in MLMP is provided by a single application server (duodecillion.ti.bfh.ch) in front of a pool of six Mac Studio machines that host the actual models. The application server runs Open WebUI and LiteLLM, which together handle authentication, access token management, request routing, and usage tracking. All Mac Studios run models via LM Studio.

At a high level:

Open WebUI is the user-facing entry point, responsible for MFA and token issuance.
LiteLLM is the single upstream endpoint configured in Open WebUI, responsible for model catalog, routing to inference backends, and usage tracking.
LM Studio runs on all Mac Studios and serves the actual model inference.

Model Hosting Topology

At a high level, the topology looks like this:

flowchart LR

    subgraph AppServer["Application Server: duodecillion.ti.bfh.ch"]
        OW["Open WebUI"]
        LL["LiteLLM"]
        OW --> LL
    end

    subgraph LMS_CLUSTER["LM Studio Fleet (6× Mac Studio)"]
        LM1["kalliope (LM Studio)<br/>minimax2.7"]
        LM2["klio (LM Studio)<br/>qwen3.5"]
        LM3["erato (LM Studio)<br/>gpt-oss:120b"]
        LM4["euterpe (LM Studio)<br/>gpt-oss:120b"]
        LM5["urania (LM Studio)<br/>nemotron"]
        LM6["thalia (LM Studio)<br/>smaller hot-loaded models"]
    end

    LL --> LM1
    LL --> LM2
    LL --> LM3
    LL --> LM4
    LL --> LM5
    LL --> LM6

Components

Open WebUI (on `duodecillion.ti.bfh.ch`)

Responsibility:
Provides the primary UI/API for end users.
Handles multi-factor authentication (MFA).
Issues and manages access tokens for users.
Integration with LiteLLM:
Configured with a single upstream endpoint, which is the LiteLLM service.
After authenticating, users send their requests (with access tokens) to Open WebUI; Open WebUI forwards these requests to LiteLLM without exposing the underlying model hosts.

Why authentication is handled in Open WebUI

The free version of LiteLLM only supports MFA for up to five users. To support a larger internal user base with proper multi-factor authentication and token management, we use Open WebUI as a separate, dedicated authentication and access control component in front of LiteLLM.

MFA setup with SwitchID (placeholder)

Open WebUI is integrated with SwitchID to enforce multi-factor authentication for MLMP users.

The following env variables are set in the docker compose file:

  - WEBUI_URL=https://inference.mlmp.ti.bfh.ch # defines the callback url 
  - OAUTH_CLIENT_ID=bfh_oidc_client_48647
  - ENABLE_OAUTH_SIGNUP=True # enforces SSO 
  - OPENID_PROVIDER_URL=https://login.eduid.ch/.well-known/openid-configuration
  - ENABLE_OAUTH_PERSISTENT_CONFIG=False
  - ENABLE_LOGIN_FORM=False # MUST be set to False if ENABLE_OAUTH_SIGNUP is true

In addition, the following secret must be set in .env:

OAUTH_CLIENT_SECRET=

These information is shared with BF-HIT services (Contact: Olga Kurz), who configured the SSO endpoint.

LiteLLM (on `duodecillion.ti.bfh.ch`)

Responsibility:
Acts as the central routing layer between Open WebUI and all model backends.
Maintains the model catalog (which models are available, and where).
Handles usage tracking for requests per model / endpoint.
Configuration:
Configuration of LitelLM is done via the config.yml file located in the docker compose root folder. In the config file, system behaviour as well as resources (e.g., models) are defined.
Fallback or failover logic is not yet implemented; routing is currently direct, based on the configured endpoints.

LM Studio (on Mac Studios)

Responsibility:
Runs on all six Mac Studios (kalliope, klio, erato, euterpe, urania, thalia), providing model serving via LM Studio.
Current usage:
kalliope (LM Studio):
- Dedicated to running minimax2.7.
klio (LM Studio):
- Dedicated to running qwen3.5.
erato (LM Studio):
- Dedicated to running gpt-oss:120b.
euterpe (LM Studio):
- Dedicated to running gpt-oss:120b.
urania (LM Studio):
- Dedicated to running nemotron-3-super.
thalia (LM Studio):
- Hosts multiple smaller models that are acceptable to hot-load on demand.
- This node is optimized for flexibility rather than maximum single-model capacity.
Details:
Specific model set, hot-loading strategy, and LM Studio configuration are defined and maintained in the infrastructure/deployment repository.

Request Flow

Conceptually, the request flow from a user to a model and back is:

User authenticates in Open WebUI (MFA) and obtains an access token.
User sends a request (e.g. chat/completion) to Open WebUI, including the access token.
Open WebUI forwards the request to the configured LiteLLM endpoint on duodecillion.ti.bfh.ch.
LiteLLM looks up the requested model in its configuration and routes the request to the appropriate LM Studio backend node (for example kalliope for minimax2.7, klio for qwen3.5, erato/euterpe for gpt-oss:120b, urania for nemotron-3-super, or thalia for smaller hot-loaded models).
The selected Mac Studio performs inference and returns the response to LiteLLM.
LiteLLM records usage, then returns the normalized response to Open WebUI.
Open WebUI returns the final response to the user.

Sequence diagram (high level)

sequenceDiagram
    participant U as User
    participant OW as Open WebUI (duodecillion.ti.bfh.ch)
    participant LL as LiteLLM (duodecillion.ti.bfh.ch)
    participant B as Inference Backend (LM Studio)

    U->>OW: Authenticate (MFA), obtain access token
    U->>OW: Send request + access token
    OW->>LL: Forward request to LiteLLM
    LL->>B: Route request to selected model endpoint
    B-->>LL: Model inference result
    LL-->>OW: Normalized response (with usage tracking)
    OW-->>U: Response to user

Hostnames and deployment details (services, ports, resource limits, etc.) for all these nodes live in the infrastructure repository (Docker Compose files and associated scripts).
Any future additions (new models, new Mac Studios) should be:
Added to the LiteLLM configuration (model name, endpoint URL, metadata).
Deployed via the standard Docker Compose / deployment pipeline used for LM Studio.

Data security

Logging of all request and response data must be deactivated on all subcomponents. Therefore, user data is never exposed to admins.

Openwebui

In the compose file, set the following envs to deactivate logging:

- GLOBAL_LOG_LEVEL=WARNING
- ENABLE_ADMIN_CHAT_ACCESS=False
- ENABLE_ADMIN_EXPORT=False

LiteLLM

In the compose file, set the following env: `LITELLM_LOG: "ERROR"``

In the config.yaml, set the following variables:

general_settings:
  store_prompts_in_spend_logs: False
  disable_error_logs: True 
litellm_settings:
  global_disable_no_log_param: True
  turn_off_message_logging: True
  json_logs: False
  set_verbose: False

LM Studio

Message logging must be configured via the UI (Developer Logs tab at the bottom of the window) as follows:

Verbose logging: off
Redact Content: on
Log Incoming Tokens: off
File Logging Mode: off

Monitoring

All inference endpoints (LM Studio hosts) as well as Open WebUI and LiteLLM on duodecillion.ti.bfh.ch are monitored with Uptime Kuma.

For the metrics and dashboard logging path (LiteLLM metrics, Prometheus scrape, external Grafana access, and NGINX allowlisting), see Inference Logging and Metrics.

See Uptime Kuma Setup for details on the monitoring stack.

Inference Architecture