Amazon SageMaker model parallel library now accelerates PyTorch FSDP workloads by up to 20%

[ad_1]

Massive language mannequin (LLM) coaching has surged in reputation over the past 12 months with the discharge of a number of standard fashions comparable to Llama 2, Falcon, and Mistral. Clients are actually pre-training and fine-tuning LLMs starting from 1 billion to over 175 billion parameters to optimize mannequin efficiency for purposes throughout industries, from healthcare to finance and advertising and marketing.

Coaching performant fashions at this scale is usually a problem. Extremely correct LLMs can require terabytes of coaching information and hundreds and even tens of millions of hours of accelerator compute time to attain goal accuracy. To finish coaching and launch merchandise in a well timed method, prospects depend on parallelism strategies to distribute this huge workload throughout as much as hundreds of accelerator units. Nevertheless, these parallelism strategies will be troublesome to make use of: completely different strategies and libraries are solely suitable with sure workloads or restricted to sure mannequin architectures, coaching efficiency will be extremely delicate to obscure configurations, and the state-of-the-art is shortly evolving. In consequence, machine studying practitioners should spend weeks of preparation to scale their LLM workloads to giant clusters of GPUs.

On this submit, we spotlight new options of the Amazon SageMaker mannequin parallel (SMP) library that simplify the massive mannequin coaching course of and make it easier to practice LLMs sooner. Particularly, we cowl the SMP library’s new simplified person expertise that builds on open supply PyTorch Absolutely Sharded Knowledge Parallel (FSDP) APIs, expanded tensor parallel performance that permits coaching fashions with tons of of billions of parameters, and efficiency optimizations that scale back mannequin coaching time and value by as much as 20%.

To be taught extra in regards to the SageMaker mannequin parallel library, confer with SageMaker mannequin parallelism library v2 documentation. It’s also possible to confer with our instance notebooks to get began.

New options that simplify and speed up giant mannequin coaching

This submit discusses the newest options included within the v2.0 launch of the SageMaker mannequin parallel library. These options enhance the usability of the library, increase performance, and speed up coaching. Within the following sections, we summarize the brand new options and focus on how you should use the library to speed up your giant mannequin coaching.

Aligning SMP with open supply PyTorch

Since its launch in 2020, SMP has enabled high-performance, large-scale coaching on SageMaker compute situations. With this newest main model launch of SMP, the library simplifies the person expertise by aligning its APIs with open supply PyTorch.

PyTorch gives Absolutely Sharded Knowledge Parallelism (FSDP) as its essential technique for supporting giant coaching workload throughout many compute units. As demonstrated within the following code snippet, SMP’s up to date APIs for strategies comparable to sharded information parallelism mirror these of PyTorch. You’ll be able to merely run import torch.sagemaker and use it instead of torch.

## training_script.py
import torch.sagemaker as tsm
tsm.init()

# Arrange a PyTorch mannequin
mannequin = …

# Wrap the PyTorch mannequin utilizing the PyTorch FSDP module
mannequin = FSDP(
mannequin,
…
)

optimizer = …
…

With these updates to SMP’s APIs, now you can understand the efficiency advantages of SageMaker and the SMP library with out overhauling your current PyTorch FSDP coaching scripts. This paradigm additionally means that you can use the identical code base when coaching on premises as on SageMaker, simplifying the person expertise for patrons who practice in a number of environments.

For extra info on methods to allow SMP together with your current PyTorch FSDP coaching scripts, confer with Get began with SMP.

Integrating tensor parallelism to allow coaching on large clusters

This launch of SMP additionally expands PyTorch FSDP’s capabilities to incorporate tensor parallelism strategies. One drawback with utilizing sharded information parallelism alone is that you could encounter convergence issues as you scale up your cluster dimension. It is because sharding parameters, gradients, and the optimizer state throughout information parallel ranks additionally will increase your international batch dimension; on giant clusters, this international batch dimension will be pushed past the edge beneath which the mannequin would converge. It’s essential to incorporate an extra parallelism method that doesn’t require a rise in international batch dimension as you scale your cluster.

To mitigate this drawback, SMP v2.0 introduces the power to compose sharded information parallelism with tensor parallelism. Tensor parallelism permits the cluster dimension to extend with out altering the worldwide batch dimension or affecting mannequin convergence. With this function, you’ll be able to safely enhance coaching throughput by provisioning clusters with 256 nodes or extra.

At present, tensor parallelism with PyTorch FSDP is simply obtainable with SMP v2. SMP v2 means that you can allow this method with a couple of traces of code change and unlock secure coaching even on giant clusters. SMP v2 integrates with Transformer Engine for its implementation of tensor parallelism and makes it suitable with the PyTorch FSDP APIs. You’ll be able to allow PyTorch FSDP and SMP tensor parallelism concurrently with out making any adjustments to your PyTorch mannequin or PyTorch FSDP configuration. The next code snippets present methods to arrange the SMP configuration dictionary in JSON format and add the SMP initialization module torch.sagemaker.init(), which accepts the configuration dictionary within the backend while you begin the coaching job, to your coaching script.

The SMP configuration is as follows:

{
“tensor_parallel_degree”: 8,
“tensor_parallel_seed”: 0
}

In your coaching script, use the next code:

import torch.sagemaker as tsm
tsm.init()

from transformers import AutoModelForCausalLM
mannequin = AutoModelForCausalLM.from_config(..)
mannequin = tsm.remodel(mannequin)

To be taught extra about utilizing tensor parallelism in SMP, confer with the tensor parallelism part of our documentation.

Use superior options to speed up mannequin coaching by as much as 20%

Along with enabling distributed coaching on clusters with tons of of situations, SMP additionally gives optimization strategies that may speed up mannequin coaching by as much as 20%. On this part, we spotlight a couple of of those optimizations. To be taught extra, confer with the core options part of our documentation.

Hybrid sharding

Sharded information parallelism is a memory-saving distributed coaching method that splits the state of a mannequin (mannequin parameters, gradients, and optimizer states) throughout units. This smaller reminiscence footprint means that you can match a bigger mannequin into your cluster or enhance the batch dimension. Nevertheless, sharded information parallelism additionally will increase the communication necessities of your coaching job as a result of the sharded mannequin artifacts are incessantly gathered from completely different units throughout coaching. On this approach, the diploma of sharding is a crucial configuration that trades off reminiscence consumption and communication overhead.

By default, PyTorch FSDP shards mannequin artifacts throughout all the accelerator units in your cluster. Relying in your coaching job, this technique of sharding might enhance communication overhead and create a bottleneck. To assist with this, the SMP library gives configurable hybrid sharded information parallelism on high of PyTorch FSDP. This function means that you can set the diploma of sharding that’s optimum to your coaching workload. Merely specify the diploma of sharding in a configuration JSON object and embrace it in your SMP coaching script.

The SMP configuration is as follows:

{ “hybrid_shard_degree”: 16 }

To be taught extra about the benefits of hybrid sharded information parallelism, confer with Close to-linear scaling of gigantic-model coaching on AWS. For extra info on implementing hybrid sharding together with your current FSDP coaching script, see hybrid shared information parallelism in our documentation.

Use the SMDDP collective communication operations optimized for AWS infrastructure

You need to use the SMP library with the SageMaker distributed information parallelism (SMDDP) library to speed up your distributed coaching workloads. SMDDP contains an optimized AllGather collective communication operation designed for finest efficiency on SageMaker p4d and p4de accelerated situations. In distributed coaching, collective communication operations are used to synchronize info throughout GPU staff. AllGather is without doubt one of the core collective communication operations usually utilized in sharded information parallelism to materialize the layer parameters earlier than ahead and backward computation steps. For coaching jobs which are bottlenecked by communication, sooner collective operations can scale back coaching time and value with no unintended effects on convergence.

To make use of the SMDDP library, you solely want so as to add two traces of code to your coaching script:

import torch.distributed as dist

# Initialize with SMDDP
import smdistributed.dataparallel.torch.torch_smddp
dist.init_process_group(backend=”smddp”) # Changing “nccl”

# Initialize with SMP
import torch.sagemaker as tsm
tsm.init()

Along with SMP, SMDDP helps open supply PyTorch FSDP and DeepSpeed. To be taught extra in regards to the SMDDP library, see Run distributed coaching with the SageMaker distributed information parallelism library.

Activation offloading

Sometimes, the ahead cross of mannequin coaching computes activations at every layer and retains them in GPU reminiscence till the backward cross for the corresponding layer finishes. These saved activations can devour important GPU reminiscence throughout coaching. Activation offloading is a way that as a substitute strikes these tensors to CPU reminiscence after the ahead cross and later fetches them again to GPU when they’re wanted. This strategy can considerably scale back GPU reminiscence utilization throughout coaching.

Though PyTorch helps activation offloading, its implementation is inefficient and might trigger GPUs to be idle whereas activations are fetched again from CPU throughout a backward cross. This may trigger important efficiency degradation when utilizing activation offloading.

SMP v2 gives an optimized activation offloading algorithm that may enhance coaching efficiency. SMP’s implementation pre-fetches activations earlier than they’re wanted on the GPU, lowering idle time.

As a result of SMP is constructed on high of PyTorch’s APIs, enabling optimized activation offloading requires just some traces of code change. Merely add the related configurations (sm_activation_offloading and activation_loading_horizon parameters) and embrace them in your coaching script.

The SMP configuration is as follows:

{
“activation_loading_horizon”: 2,
“sm_activation_offloading”: True
}

Within the coaching script, use the next code:

import torch.sagemaker as tsm
tsm.init()

# Native PyTorch module for activation offloading
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
apply_activation_checkpointing,
offload_wrapper,
)

mannequin = FSDP(…)

# Activation offloading requires activation checkpointing.
apply_activation_checkpointing(
mannequin,
check_fn=checkpoint_tformer_layers_policy,
)

mannequin = offload_wrapper(mannequin)

To be taught extra in regards to the open supply PyTorch checkpoint instruments for activation offloading, see the checkpoint_wrapper.py script within the PyTorch GitHub repository and Activation Checkpointing within the PyTorch weblog submit Scaling Multimodal Basis Fashions in TorchMultimodal with Pytorch Distributed. To be taught extra about SMP’s optimized implementation of activation offloading, see the activation offloading part of our documentation.

Past hybrid sharding, SMDDP, and activation offloading, SMP gives extra optimizations that may speed up your giant mannequin coaching workload. This contains optimized activation checkpointing, delayed parameter initialization, and others. To be taught extra, confer with the core options part of our documentation.

Conclusion

As datasets, mannequin sizes, and coaching clusters proceed to develop, environment friendly distributed coaching is more and more essential for well timed and reasonably priced mannequin and product supply. The newest launch of the SageMaker mannequin parallel library helps you obtain this by lowering code change and aligning with PyTorch FSDP APIs, enabling coaching on large clusters through tensor parallelism and optimizations that may scale back coaching time by as much as 20%.

To get began with SMP v2, confer with our documentation and our pattern notebooks.

Concerning the Authors

Robert Van Dusen is a Senior Product Supervisor with Amazon SageMaker. He leads frameworks, compilers, and optimization strategies for deep studying coaching.

Luis Quintela is the Software program Developer Supervisor for the AWS SageMaker mannequin parallel library. In his spare time, he will be discovered driving his Harley within the SF Bay Space.

Gautam Kumar is a Software program Engineer with AWS AI Deep Studying. He’s enthusiastic about constructing instruments and methods for AI. In his spare time, he take pleasure in biking and studying books.