Inference Architecture
Inference Architecture
Overview
Inference in MLMP is provided by a single application server (duodecillion.ti.bfh.ch) in front of a pool of six Mac Studio machines that host the actual models. The application server runs Open WebUI and LiteLLM, which together handle authentication, access token management, request routing, and usage tracking. The Mac Studios run models via Exo and LM Studio.
At a high level:
- Open WebUI is the user-facing entry point, responsible for MFA and token issuance.
- LiteLLM is the single upstream endpoint configured in Open WebUI, responsible for model catalog, routing to inference backends, and usage tracking.
- Exo and LM Studio run on Mac Studios and serve the actual model inference.
Model Hosting Topology
At a high level, the topology looks like this:
flowchart LR
subgraph AppServer["Application Server: duodecillion.ti.bfh.ch"]
OW["Open WebUI"]
LL["LiteLLM"]
OW --> LL
end
%% LiteLLM connects to a single head node
LL --> EXO1
subgraph EXO_CLUSTER["Exo Thunderbolt Cluster (4× Mac Studio)"]
EXO1["Mac Studio 1<br/>(Exo Head / API Endpoint)"]
EXO2["Mac Studio 2<br/>(Worker)"]
EXO3["Mac Studio 3<br/>(Worker)"]
EXO4["Mac Studio 4<br/>(Worker)"]
%% fully connected fabric
EXO1 --- EXO2
EXO1 --- EXO3
EXO1 --- EXO4
EXO2 --- EXO3
EXO2 --- EXO4
EXO3 --- EXO4
end
%% Separate LM Studio machines
LL --> LM1["Mac Studio (LM Studio)<br/>gpt-oss:120b"]
LL --> LM2["Mac Studio (LM Studio)<br/>smaller hot-loaded models"]
Components
Open WebUI (on duodecillion.ti.bfh.ch)
- Responsibility:
- Provides the primary UI/API for end users.
- Handles multi-factor authentication (MFA).
- Issues and manages access tokens for users.
- Integration with LiteLLM:
- Configured with a single upstream endpoint, which is the LiteLLM service.
- After authenticating, users send their requests (with access tokens) to Open WebUI; Open WebUI forwards these requests to LiteLLM without exposing the underlying model hosts.
Why authentication is handled in Open WebUI
The free version of LiteLLM only supports MFA for up to five users. To support a larger internal user base with proper multi-factor authentication and token management, we use Open WebUI as a separate, dedicated authentication and access control component in front of LiteLLM.
MFA setup with SwitchID (placeholder)
Open WebUI is integrated with SwitchID to enforce multi-factor authentication for MLMP users.
The following env variables are set in the docker compose file:
- WEBUI_URL=https://inference.mlmp.ti.bfh.ch # defines the callback url
- OAUTH_CLIENT_ID=bfh_oidc_client_48647
- ENABLE_OAUTH_SIGNUP=True # enforces SSO
- OPENID_PROVIDER_URL=https://login.eduid.ch/.well-known/openid-configuration
- ENABLE_OAUTH_PERSISTENT_CONFIG=False
- ENABLE_LOGIN_FORM=False # MUST be set to False if ENABLE_OAUTH_SIGNUP is true
In addition, the following secret must be set in .env:
OAUTH_CLIENT_SECRET=
These information is shared with BF-HIT services (Contact: Olga Kurz), who configured the SSO endpoint.
LiteLLM (on duodecillion.ti.bfh.ch)
- Responsibility:
- Acts as the central routing layer between Open WebUI and all model backends.
- Maintains the model catalog (which models are available, and where).
- Handles usage tracking for requests per model / endpoint.
- Configuration:
- Configuration of LitelLM is done via the config.yml file located in the docker compose root folder. In the config file, system behaviour as well as resources (e.g., models) are defined.
- Fallback or failover logic is not yet implemented; routing is currently direct, based on the configured endpoints.
Exo (on Mac Studios)
- Responsibility:
- Runs on four of the six Mac Studios.
- Used to host the largest currently available models.
- Current usage:
- These nodes are primarily used to run very large models.
- Each Exo instance exposes one or more HTTP endpoints, which are registered in the LiteLLM configuration as model backends.
- Details:
- Hostnames, ports, and Exo-specific settings are documented in the deployment repository (see Docker Compose / service definitions).
LM Studio (on Mac Studios)
- Responsibility:
- Runs on the remaining two Mac Studios, providing model serving via LM Studio.
- Current usage:
- Mac Studio A (LM Studio):
- Dedicated to running
gpt-oss:120bas a large model endpoint.
- Dedicated to running
- Mac Studio B (LM Studio):
- Hosts multiple smaller models that are acceptable to hot-load on demand.
- This node is optimized for flexibility rather than maximum single-model capacity.
- Details:
- Specific model set, hot-loading strategy, and LM Studio configuration are defined and maintained in the infrastructure/deployment repository.
Request Flow
Conceptually, the request flow from a user to a model and back is:
- User authenticates in Open WebUI (MFA) and obtains an access token.
- User sends a request (e.g. chat/completion) to Open WebUI, including the access token.
- Open WebUI forwards the request to the configured LiteLLM endpoint on
duodecillion.ti.bfh.ch. - LiteLLM looks up the requested model in its configuration and routes the request to the appropriate inference backend:
- One of the Exo nodes (for large models such as
kimi-k2), or - One of the LM Studio nodes (for
gpt-oss:120bor smaller hot-loaded models). - The selected Mac Studio performs inference and returns the response to LiteLLM.
- LiteLLM records usage, then returns the normalized response to Open WebUI.
- Open WebUI returns the final response to the user.
Sequence diagram (high level)
sequenceDiagram
participant U as User
participant OW as Open WebUI (duodecillion.ti.bfh.ch)
participant LL as LiteLLM (duodecillion.ti.bfh.ch)
participant B as Inference Backend (Exo / LM Studio)
U->>OW: Authenticate (MFA), obtain access token
U->>OW: Send request + access token
OW->>LL: Forward request to LiteLLM
LL->>B: Route request to selected model endpoint
B-->>LL: Model inference result
LL-->>OW: Normalized response (with usage tracking)
OW-->>U: Response to user
- Hostnames and deployment details (services, ports, resource limits, etc.) for all these nodes live in the infrastructure repository (Docker Compose files and associated scripts).
- Any future additions (new models, new Mac Studios) should be:
- Added to the LiteLLM configuration (model name, endpoint URL, metadata).
- Deployed via the standard Docker Compose / deployment pipeline used for Exo and LM Studio.
Data security
Logging of all request and response data must be deactivated on all subcomponents. Therefore, user data is never exposed to admins.
Openwebui
In the compose file, set the following envs to deactivate logging:
- GLOBAL_LOG_LEVEL=WARNING
- ENABLE_ADMIN_CHAT_ACCESS=False
- ENABLE_ADMIN_EXPORT=False
LiteLLM
In the compose file, set the following env: `LITELLM_LOG: "ERROR"``
In the config.yaml, set the following variables:
general_settings:
store_prompts_in_spend_logs: False
disable_error_logs: True
litellm_settings:
global_disable_no_log_param: True
turn_off_message_logging: True
json_logs: False
set_verbose: False
LM Studio
Message logging must be configured via the UI (Developer Logs tab at the bottom of the window) as follows:
- Verbose logging: off
- Redact Content: on
- Log Incoming Tokens: off
- File Logging Mode: off