Requirements for deploying PEFT models

Last updated: Jun 26, 2025

Review supported model architectures, software requirements, and hardware requirements for deploying fine-tuned models that are trained with PEFT techniques.

Supported model architectures

Models that are trained with supported architectures can be deployed by using watsonx.ai.

The following base models are supported for fine tuning with LoRA techniques and can be deployed with watsonx.ai:

Supported model architectures for PEFT techniques
Model architecture	Model	PEFT Technique
Granite	`ibm/granite-3-1-8b-base`	LoRA
Llama	`meta-llama/llama-3-1-8b` `meta-llama/llama-3-1-70b`	LoRA

Software requirements

You can use the watsonx-cfm-caikit-1.1 software specification, which is based on the vLLM runtime engine, to deploy your fine-tuned model that is trained with a PEFT technique.

Hardware requirements

Although PEFT uses less memory compared to instruction fine-tuning, it is still a resource-intensive process that requires you to have GPU resources available for deployment.

Pre-defined hardware specifications WX-S, WX-M, WX-L and WX-XL are applicable only for these standard supported hardware configurations:

NVIDIA A100 80 GB of GPU memory
NVIDIA H100 with 80 GB of GPU memory

If your GPU configuration is different (for example NVIDIA L40S with 48 GB of GPU memory), you must create a custom hardware specification. For details, see Creating a custom hardware specification.

Supported hardware specifications

When deploying base foundation models with PEFT models(LoRA or QLoRA adapters), you must select a hardware specification that aligns with the parameter count of the base model and the number of adapters to be used.

Based on the number of parameters used in the base foundation model and the number of adapters to be used, choose a hardware specification to deploy the base foundation model.

You can use the following predefined hardware specifications for deployment:

Predefined hardware configurations
Parameters Range	Hardware specification	Memory available
1B to 20B	WX-S	1 GPU, 2 CPU and 60 GB
21B to 40B	WX-M	2 GPU, 3 CPU and 120 GB
41B to 80B	WX-L	4 GPU, 5 CPU and 240 GB
81B to 200B	WX-XL	8 GPU, 9 CPU and 600 GB

Supported deployment types

You can create an online deployment for PEFT models. Online deployment allows for real-time inferencing and is suitable for applications that require low-latency predictions.

Batch deployments are not currently supported for deploying PEFT models.

Parent topic: Deploying PEFT models

Was the topic helpful?

0/1000