Requirements for deploying PEFT models
Review supported model architectures, software requirements, and hardware requirements for deploying fine-tuned models that are trained with PEFT techniques.
Supported model architectures
Models that are trained with supported architectures can be deployed by using watsonx.ai.
The following base models are supported for fine tuning with LoRA techniques and can be deployed with watsonx.ai:
Model architecture | Model | PEFT Technique |
---|---|---|
Granite | ibm/granite-3-1-8b-base |
LoRA |
Llama | meta-llama/llama-3-1-8b meta-llama/llama-3-1-70b |
LoRA |
Software requirements
You can use the watsonx-cfm-caikit-1.1
software specification, which is based on the vLLM
runtime engine, to deploy your fine-tuned model that is trained with a PEFT technique.
Hardware requirements
Although PEFT uses less memory compared to instruction fine-tuning, it is still a resource-intensive process that requires you to have GPU resources available for deployment.
Pre-defined hardware specifications WX-S, WX-M, WX-L and WX-XL are applicable only for these standard supported hardware configurations:
- NVIDIA A100 80 GB of GPU memory
- NVIDIA H100 with 80 GB of GPU memory
If your GPU configuration is different (for example NVIDIA L40S with 48 GB of GPU memory), you must create a custom hardware specification. For details, see Creating a custom hardware specification.
Supported hardware specifications
When deploying base foundation models with PEFT models(LoRA or QLoRA adapters), you must select a hardware specification that aligns with the parameter count of the base model and the number of adapters to be used.
Based on the number of parameters used in the base foundation model and the number of adapters to be used, choose a hardware specification to deploy the base foundation model.
You can use the following predefined hardware specifications for deployment:
Parameters Range | Hardware specification | Memory available |
---|---|---|
1B to 20B | WX-S | 1 GPU, 2 CPU and 60 GB |
21B to 40B | WX-M | 2 GPU, 3 CPU and 120 GB |
41B to 80B | WX-L | 4 GPU, 5 CPU and 240 GB |
81B to 200B | WX-XL | 8 GPU, 9 CPU and 600 GB |
Supported deployment types
You can create an online deployment for PEFT models. Online deployment allows for real-time inferencing and is suitable for applications that require low-latency predictions.
Batch deployments are not currently supported for deploying PEFT models.
Parent topic: Deploying PEFT models