Inference Architecture

Overview

Inference in MLMP is provided by a single application server (duodecillion.ti.bfh.ch) in front of a pool of six Mac Studio machines that host the actual models. The application server runs Open WebUI and LiteLLM, which together handle authentication, access token management, request routing, and usage tracking. The Mac Studios run models via Exo and LM Studio.

At a high level:

Open WebUI is the user-facing entry point, responsible for MFA and token issuance.
LiteLLM is the single upstream endpoint configured in Open WebUI, responsible for model catalog, routing to inference backends, and usage tracking.
Exo and LM Studio run on Mac Studios and serve the actual model inference.

Model Hosting Topology

At a high level, the topology looks like this:

flowchart LR

    subgraph AppServer["Application Server: duodecillion.ti.bfh.ch"]
        OW["Open WebUI"]
        LL["LiteLLM"]
        OW --> LL
    end

    %% LiteLLM connects to a single head node
    LL --> EXO1

    subgraph EXO_CLUSTER["Exo Thunderbolt Cluster (4× Mac Studio)"]
        EXO1["Mac Studio 1<br/>(Exo Head / API Endpoint)"]
        EXO2["Mac Studio 2<br/>(Worker)"]
        EXO3["Mac Studio 3<br/>(Worker)"]
        EXO4["Mac Studio 4<br/>(Worker)"]

        %% fully connected fabric
        EXO1 --- EXO2
        EXO1 --- EXO3
        EXO1 --- EXO4
        EXO2 --- EXO3
        EXO2 --- EXO4
        EXO3 --- EXO4
    end

    %% Separate LM Studio machines
    LL --> LM1["Mac Studio (LM Studio)<br/>gpt-oss:120b"]
    LL --> LM2["Mac Studio (LM Studio)<br/>smaller hot-loaded models"]

Components

Open WebUI (on `duodecillion.ti.bfh.ch`)

Responsibility:
Provides the primary UI/API for end users.
Handles multi-factor authentication (MFA).
Issues and manages access tokens for users.
Integration with LiteLLM:
Configured with a single upstream endpoint, which is the LiteLLM service.
After authenticating, users send their requests (with access tokens) to Open WebUI; Open WebUI forwards these requests to LiteLLM without exposing the underlying model hosts.

Why authentication is handled in Open WebUI

The free version of LiteLLM only supports MFA for up to five users. To support a larger internal user base with proper multi-factor authentication and token management, we use Open WebUI as a separate, dedicated authentication and access control component in front of LiteLLM.

MFA setup with SwitchID (placeholder)

Open WebUI is integrated with SwitchID to enforce multi-factor authentication for MLMP users.

The following env variables are set in the docker compose file:

  - WEBUI_URL=https://inference.mlmp.ti.bfh.ch # defines the callback url 
  - OAUTH_CLIENT_ID=bfh_oidc_client_48647
  - ENABLE_OAUTH_SIGNUP=True # enforces SSO 
  - OPENID_PROVIDER_URL=https://login.eduid.ch/.well-known/openid-configuration
  - ENABLE_OAUTH_PERSISTENT_CONFIG=False
  - ENABLE_LOGIN_FORM=False # MUST be set to False if ENABLE_OAUTH_SIGNUP is true

In addition, the following secret must be set in .env:

OAUTH_CLIENT_SECRET=

These information is shared with BF-HIT services (Contact: Olga Kurz), who configured the SSO endpoint.

LiteLLM (on `duodecillion.ti.bfh.ch`)

Responsibility:
Acts as the central routing layer between Open WebUI and all model backends.
Maintains the model catalog (which models are available, and where).
Handles usage tracking for requests per model / endpoint.
Configuration:
Configuration of LitelLM is done via the config.yml file located in the docker compose root folder. In the config file, system behaviour as well as resources (e.g., models) are defined.
Fallback or failover logic is not yet implemented; routing is currently direct, based on the configured endpoints.

Exo (on Mac Studios)

Responsibility:
Runs on four of the six Mac Studios.
Used to host the largest currently available models.
Current usage:
These nodes are primarily used to run very large models.
Each Exo instance exposes one or more HTTP endpoints, which are registered in the LiteLLM configuration as model backends.
Details:
Hostnames, ports, and Exo-specific settings are documented in the deployment repository (see Docker Compose / service definitions).

LM Studio (on Mac Studios)

Responsibility:
Runs on the remaining two Mac Studios, providing model serving via LM Studio.
Current usage:
Mac Studio A (LM Studio):
- Dedicated to running gpt-oss:120b as a large model endpoint.
Mac Studio B (LM Studio):
- Hosts multiple smaller models that are acceptable to hot-load on demand.
- This node is optimized for flexibility rather than maximum single-model capacity.
Details:
Specific model set, hot-loading strategy, and LM Studio configuration are defined and maintained in the infrastructure/deployment repository.

Request Flow

Conceptually, the request flow from a user to a model and back is:

User authenticates in Open WebUI (MFA) and obtains an access token.
User sends a request (e.g. chat/completion) to Open WebUI, including the access token.
Open WebUI forwards the request to the configured LiteLLM endpoint on duodecillion.ti.bfh.ch.
LiteLLM looks up the requested model in its configuration and routes the request to the appropriate inference backend:
One of the Exo nodes (for large models such as kimi-k2), or
One of the LM Studio nodes (for gpt-oss:120b or smaller hot-loaded models).
The selected Mac Studio performs inference and returns the response to LiteLLM.
LiteLLM records usage, then returns the normalized response to Open WebUI.
Open WebUI returns the final response to the user.

Sequence diagram (high level)

sequenceDiagram
    participant U as User
    participant OW as Open WebUI (duodecillion.ti.bfh.ch)
    participant LL as LiteLLM (duodecillion.ti.bfh.ch)
    participant B as Inference Backend (Exo / LM Studio)

    U->>OW: Authenticate (MFA), obtain access token
    U->>OW: Send request + access token
    OW->>LL: Forward request to LiteLLM
    LL->>B: Route request to selected model endpoint
    B-->>LL: Model inference result
    LL-->>OW: Normalized response (with usage tracking)
    OW-->>U: Response to user

Hostnames and deployment details (services, ports, resource limits, etc.) for all these nodes live in the infrastructure repository (Docker Compose files and associated scripts).
Any future additions (new models, new Mac Studios) should be:
Added to the LiteLLM configuration (model name, endpoint URL, metadata).
Deployed via the standard Docker Compose / deployment pipeline used for Exo and LM Studio.

Data security

Logging of all request and response data must be deactivated on all subcomponents. Therefore, user data is never exposed to admins.

Openwebui

In the compose file, set the following envs to deactivate logging:

- GLOBAL_LOG_LEVEL=WARNING
- ENABLE_ADMIN_CHAT_ACCESS=False
- ENABLE_ADMIN_EXPORT=False

LiteLLM

In the compose file, set the following env: `LITELLM_LOG: "ERROR"``

In the config.yaml, set the following variables:

general_settings:
  store_prompts_in_spend_logs: False
  disable_error_logs: True 
litellm_settings:
  global_disable_no_log_param: True
  turn_off_message_logging: True
  json_logs: False
  set_verbose: False

LM Studio

Message logging must be configured via the UI (Developer Logs tab at the bottom of the window) as follows:

Verbose logging: off
Redact Content: on
Log Incoming Tokens: off
File Logging Mode: off

Inference Architecture