Amazon EC2 DL2q instance for cost-efficient, high-performance AI inference is now generally available

[ad_1]

It is a visitor submit by A.Okay Roy from Qualcomm AI.

Amazon Elastic Compute Cloud (Amazon EC2) DL2q cases, powered by Qualcomm AI 100 Customary accelerators, can be utilized to cost-efficiently deploy deep studying (DL) workloads within the cloud. They may also be used to develop and validate efficiency and accuracy of DL workloads that shall be deployed on Qualcomm units. DL2q cases are the primary cases to convey Qualcomm’s synthetic clever (AI) expertise to the cloud.

With eight Qualcomm AI 100 Customary accelerators and 128 GiB of whole accelerator reminiscence, prospects also can use DL2q cases to run in style generative AI purposes, akin to content material technology, textual content summarization, and digital assistants, in addition to traditional AI purposes for pure language processing and pc imaginative and prescient. Moreover, Qualcomm AI 100 accelerators characteristic the identical AI expertise used throughout smartphones, autonomous driving, private computer systems, and prolonged actuality headsets, so DL2q cases can be utilized to develop and validate these AI workloads earlier than deployment.

New DL2q occasion highlights

Every DL2q occasion incorporates eight Qualcomm Cloud AI100 accelerators, with an aggregated efficiency of over 2.8 PetaOps of Int8 inference efficiency and 1.4 PetaFlops of FP16 inference efficiency. The occasion has an combination 112 of AI cores, accelerator reminiscence capability of 128 GB and reminiscence bandwidth of 1.1 TB per second.

Every DL2q occasion has 96 vCPUs, a system reminiscence capability of 768 GB and helps a networking bandwidth of 100 Gbps in addition to Amazon Elastic Block Retailer (Amazon EBS) storage of 19 Gbps.

Occasion identify
vCPUs
Cloud AI100 accelerators
Accelerator reminiscence
Accelerator reminiscence BW (aggregated)
Occasion reminiscence
Occasion networking
Storage (Amazon EBS) bandwidth

DL2q.24xlarge
96
8
128 GB
1.088 TB/s
768 GB
100 Gbps
19 Gbps

Qualcomm Cloud AI100 accelerator innovation

The Cloud AI100 accelerator system-on-chip (SoC) is a purpose-built, scalable multi-core structure, supporting a variety of deep-learning use-cases spanning from the datacenter to the sting. The SoC employs scalar, vector, and tensor compute cores with an industry-leading on-die SRAM capability of 126 MB. The cores are interconnected with a high-bandwidth low-latency network-on-chip (NoC) mesh.

The AI100 accelerator helps a broad and complete vary of fashions and use-cases. The desk beneath highlights the vary of the mannequin assist.

Mannequin class
Variety of fashions
Examples

NLP
157
BERT, BART, FasterTransformer, T5, Z-code MOE

Generative AI – NLP
40
LLaMA, CodeGen, GPT, OPT, BLOOM, Jais, Luminous, StarCoder, XGen

Generative AI – Picture
3
Secure diffusion v1.5 and v2.1, OpenAI CLIP

CV – Picture classification
45
ViT, ResNet, ResNext, MobileNet, EfficientNet

CV – Object detection
23
YOLO v2, v3, v4, v5, and v7, SSD-ResNet, RetinaNet

CV – Different
15
LPRNet, Tremendous-resolution/SRGAN, ByteTrack

Automotive networks*
53
Notion and LIDAR, pedestrian, lane, and visitors gentle detection

Whole
>300

* Most automotive networks are composite networks consisting of a fusion of particular person networks.

The massive on-die SRAM on the DL2q accelerator allows environment friendly implementation of superior efficiency strategies akin to MX6 micro-exponent precision for storing the weights and MX9 micro-exponent precision for accelerator-to-accelerator communication. The micro-exponent expertise is described within the following Open Compute Undertaking (OCP) {industry} announcement: AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Standardize Subsequent-Era Slim Precision Information Codecs for AI » Open Compute Undertaking.

The occasion consumer can use the next technique to maximise the performance-per-cost:

Retailer weights utilizing the MX6 micro-exponent precision within the on-accelerator DDR reminiscence. Utilizing the MX6 precision maximizes the utilization of the accessible reminiscence capability and the memory-bandwidth to ship best-in-class throughput and latency.
Compute in FP16 to ship the required use case accuracy, whereas utilizing the superior on-chip SRAM and spare TOPs on the cardboard, to implement high-performance low-latency MX6 to FP16 kernels.
Use an optimized batching technique and a better batch-size through the use of the massive on-chip SRAM accessible to maximise the reuse of weights, whereas retaining the activations on-chip to the utmost doable.

DL2q AI Stack and toolchain

The DL2q occasion is accompanied by the Qualcomm AI Stack that delivers a constant developer expertise throughout Qualcomm AI within the cloud and different Qualcomm merchandise. The identical Qualcomm AI stack and base AI expertise runs on the DL2q cases and Qualcomm edge units, offering prospects a constant developer expertise, with a unified API throughout their cloud, automotive, private pc, prolonged actuality, and smartphone growth environments.

The toolchain allows the occasion consumer to rapidly onboard a beforehand educated mannequin, compile and optimize the mannequin for the occasion capabilities, and subsequently deploy the compiled fashions for manufacturing inference use-cases in three steps proven within the following determine.

To study extra about tuning the efficiency of a mannequin, see the Cloud AI 100 Key Efficiency Parameters Documentation.

Get began with DL2q cases

On this instance, you compile and deploy a pre-trained BERT mannequin from Hugging Face on an EC2 DL2q occasion utilizing a pre-built accessible DL2q AMI, in 4 steps.

You should utilize both a pre-built Qualcomm DLAMI on the occasion or begin with an Amazon Linux2 AMI and construct your personal DL2q AMI with the Cloud AI 100 Platform and Apps SDK accessible on this Amazon Easy Storage Service (Amazon S3) bucket: s3://ec2-linux-qualcomm-ai100-sdks/newest/.

The steps that observe use the pre-built DL2q AMI, Qualcomm Base AL2 DLAMI.

Use SSH to entry your DL2q occasion with the Qualcomm Base AL2 DLAMI AMI and observe steps 1 via 4.

Step 1. Arrange the setting and set up required packages

Set up Python 3.8.

sudo amazon-linux-extras set up python3.8

Arrange the Python 3.8 digital setting.

python3.8 -m venv /dwelling/ec2-user/userA/pyenv

Activate the Python 3.8 digital setting.

supply /dwelling/ec2-user/userA/pyenv/bin/activate

Set up the required packages, proven within the necessities.txt doc accessible on the Qualcomm public Github web site.

pip3 set up -r necessities.txt

Import the mandatory libraries.

import transformers
from transformers import AutoTokenizer, AutoModelForMaskedLM
import sys
import qaic
import os
import torch
import onnx
from onnxsim import simplify
import argparse
import numpy as np

Step 2. Import the mannequin

Import and tokenize the mannequin.

model_card = ‘bert-base-cased’
mannequin = AutoModelForMaskedLM.from_pretrained(model_card)
tokenizer = AutoTokenizer.from_pretrained(model_card)

Outline a pattern enter and extract the inputIds and attentionMask.

sentence = “The canine [MASK] on the mat.”
encodings = tokenizer(sentence, max_length=128, truncation=True, padding=”max_length”, return_tensors=”pt”)
inputIds = encodings[“input_ids”]
attentionMask = encodings[“attention_mask”]

Convert the mannequin to ONNX, which may then be handed to the compiler.

# Set dynamic dims and axes.
dynamic_dims = {0: ‘batch’, 1 : ‘sequence’}
dynamic_axes = {
“input_ids” : dynamic_dims,
“attention_mask” : dynamic_dims,
“logits” : dynamic_dims
}
input_names = [“input_ids”, “attention_mask”]
inputList = [inputIds, attentionMask]

torch.onnx.export(
mannequin,
args=tuple(inputList),
f=f”{gen_models_path}/{model_base_name}.onnx”,
verbose=False,
input_names=input_names,
output_names=[“logits”],
dynamic_axes=dynamic_axes,
opset_version=11,
)

You’ll run the mannequin in FP16 precision. So, you must examine if the mannequin incorporates any constants past the FP16 vary. Go the mannequin to the fix_onnx_fp16 perform to generate the brand new ONNX file with the fixes required.

from onnx import numpy_helper

def fix_onnx_fp16(
gen_models_path: str,
model_base_name: str,
) -> str:
finfo = np.finfo(np.float16)
fp16_max = finfo.max
fp16_min = finfo.min
mannequin = onnx.load(f”{gen_models_path}/{model_base_name}.onnx”)
fp16_fix = False
for tensor in onnx.external_data_helper._get_all_tensors(mannequin):
nptensor = numpy_helper.to_array(tensor, gen_models_path)
if nptensor.dtype == np.float32 and (
np.any(nptensor > fp16_max) or np.any(nptensor < fp16_min)
):
# print(f’tensor worth : {nptensor} above {fp16_max} or beneath {fp16_min}’)
nptensor = np.clip(nptensor, fp16_min, fp16_max)
new_tensor = numpy_helper.from_array(nptensor, tensor.identify)
tensor.CopyFrom(new_tensor)
fp16_fix = True

if fp16_fix:
# Save FP16 mannequin
print(“Discovered constants out of FP16 vary, clipped to FP16 vary”)
model_base_name += “_fix_outofrange_fp16″
onnx.save(mannequin, f=f”{gen_models_path}/{model_base_name}.onnx”)
print(f”Saving modified onnx file at {gen_models_path}/{model_base_name}.onnx”)
return model_base_name

fp16_model_name = fix_onnx_fp16(gen_models_path=gen_models_path, model_base_name=model_base_name)

Step 3. Compile the mannequin

The qaic-exec command line interface (CLI) compiler instrument is used to compile the mannequin. The enter to this compiler is the ONNX file generated in step 2. The compiler produces a binary file (known as QPC, for Qualcomm program container) within the path outlined by -aic-binary-dir argument.

Within the compile command beneath, you utilize 4 AI compute cores and a batch dimension of 1 to compile the mannequin.

/choose/qti-aic/exec/qaic-exec
-m=bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16.onnx
-aic-num-cores=4
-convert-to-fp16
-onnx-define-symbol=batch,1 -onnx-define-symbol=sequence,128
-aic-binary-dir=bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc
-aic-hw -aic-hw-version=2.0
-compile-only

The QPC is generated within the bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc folder.

Step 4. Run the mannequin

Arrange a session to run the inference on a Cloud AI100 Qualcomm accelerator within the DL2q occasion.

The Qualcomm qaic Python library is a set of APIs that gives assist for operating inference on the Cloud AI100 accelerator.

Use the Session API name to create an occasion of session. The Session API name is the entry level to utilizing the qaic Python library.

qpcPath=”bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc”

bert_sess = qaic.Session(model_path= qpcPath+’/programqpc.bin’, num_activations=1)
bert_sess.setup() # Masses the community to the system.

# Right here we’re studying out all of the enter and output shapes/sorts
input_shape, input_type = bert_sess.model_input_shape_dict[‘input_ids’]
attn_shape, attn_type = bert_sess.model_input_shape_dict[‘attention_mask’]
output_shape, output_type = bert_sess.model_output_shape_dict[‘logits’]

#create the enter dictionary for given enter sentence
input_dict = {“input_ids”: inputIds.numpy().astype(input_type), “attention_mask” : attentionMask.numpy().astype(attn_type)}

#run inference on Cloud AI 100
output = bert_sess.run(input_dict)

Restructure the information from output buffer with output_shape and output_type.

token_logits = np.frombuffer(output[‘logits’], dtype=output_type).reshape(output_shape)

Decode the output produced.

mask_token_logits = torch.from_numpy(token_logits[0, mask_token_index, :]).unsqueeze(0)
top_5_results = torch.topk(mask_token_logits, 5, dim=1)
print(“Mannequin output (top5) from Qualcomm Cloud AI 100:”)
for i in vary(5):
idx = top_5_results.indices[0].tolist()[i]
val = top_5_results.values[0].tolist()[i]
phrase = tokenizer.decode([idx])
print(f”{i+1} :(phrase={phrase}, index={idx}, logit={spherical(val,2)})”)

Listed here are the outputs for the enter sentence “The canine [MASK] on the mat.”

1 :(phrase=sat, index=2068, logit=11.46)
2 :(phrase=landed, index=4860, logit=11.11)
3 :(phrase=spat, index=15732, logit=10.95)
4 :(phrase=settled, index=3035, logit=10.84)
5 :(phrase=was, index=1108, logit=10.75)

That’s it. With only a few steps, you compiled and ran a PyTorch mannequin on an Amazon EC2 DL2q occasion. To study extra about onboarding and compiling fashions on the DL2q occasion, see the Cloud AI100 Tutorial Documentation.

To study extra about which DL mannequin architectures are an excellent match for AWS DL2q cases and the present mannequin assist matrix, see the Qualcomm Cloud AI100 documentation.

Obtainable now

You possibly can launch DL2q cases at this time within the US West (Oregon) and Europe (Frankfurt) AWS Areas as On-demand, Reserved, and Spot Situations, or as a part of a Financial savings Plan. As typical with Amazon EC2, you pay just for what you utilize. For extra data, see Amazon EC2 pricing.

DL2q cases could be deployed utilizing AWS Deep Studying AMIs (DLAMI), and container photographs can be found by way of managed providers akin to Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS ParallelCluster.

To study extra, go to the Amazon EC2 DL2q occasion web page, and ship suggestions to AWS re:Publish for EC2 or by way of your typical AWS Help contacts.

Concerning the authors

A.Okay Roy is a Director of Product Administration at Qualcomm, for Cloud and Datacenter AI merchandise and options. He has over 20 years of expertise in product technique and growth, with the present focus of best-in-class efficiency and efficiency/$ end-to-end options for AI inference within the Cloud, for the broad vary of use-cases, together with GenAI, LLMs, Auto and Hybrid AI.

Jianying Lang is a Principal Options Architect at AWS Worldwide Specialist Group (WWSO). She has over 15 years of working expertise in HPC and AI discipline. At AWS, she focuses on serving to prospects deploy, optimize, and scale their AI/ML workloads on accelerated computing cases. She is enthusiastic about combining the strategies in HPC and AI fields. Jianying holds a PhD diploma in Computational Physics from the College of Colorado at Boulder.