[ad_1]
It is a visitor submit by A.Okay Roy from Qualcomm AI.
Amazon Elastic Compute Cloud (Amazon EC2) DL2q cases, powered by Qualcomm AI 100 Customary accelerators, can be utilized to cost-efficiently deploy deep studying (DL) workloads within the cloud. They may also be used to develop and validate efficiency and accuracy of DL workloads that shall be deployed on Qualcomm units. DL2q cases are the primary cases to convey Qualcomm’s synthetic clever (AI) expertise to the cloud.
With eight Qualcomm AI 100 Customary accelerators and 128 GiB of whole accelerator reminiscence, prospects also can use DL2q cases to run in style generative AI purposes, akin to content material technology, textual content summarization, and digital assistants, in addition to traditional AI purposes for pure language processing and pc imaginative and prescient. Moreover, Qualcomm AI 100 accelerators characteristic the identical AI expertise used throughout smartphones, autonomous driving, private computer systems, and prolonged actuality headsets, so DL2q cases can be utilized to develop and validate these AI workloads earlier than deployment.
New DL2q occasion highlights
Every DL2q occasion incorporates eight Qualcomm Cloud AI100 accelerators, with an aggregated efficiency of over 2.8 PetaOps of Int8 inference efficiency and 1.4 PetaFlops of FP16 inference efficiency. The occasion has an combination 112 of AI cores, accelerator reminiscence capability of 128 GB and reminiscence bandwidth of 1.1 TB per second.
Every DL2q occasion has 96 vCPUs, a system reminiscence capability of 768 GB and helps a networking bandwidth of 100 Gbps in addition to Amazon Elastic Block Retailer (Amazon EBS) storage of 19 Gbps.
Occasion identify
vCPUs
Cloud AI100 accelerators
Accelerator reminiscence
Accelerator reminiscence BW (aggregated)
Occasion reminiscence
Occasion networking
Storage (Amazon EBS) bandwidth
DL2q.24xlarge
96
8
128 GB
1.088 TB/s
768 GB
100 Gbps
19 Gbps
Qualcomm Cloud AI100 accelerator innovation
The Cloud AI100 accelerator system-on-chip (SoC) is a purpose-built, scalable multi-core structure, supporting a variety of deep-learning use-cases spanning from the datacenter to the sting. The SoC employs scalar, vector, and tensor compute cores with an industry-leading on-die SRAM capability of 126 MB. The cores are interconnected with a high-bandwidth low-latency network-on-chip (NoC) mesh.
The AI100 accelerator helps a broad and complete vary of fashions and use-cases. The desk beneath highlights the vary of the mannequin assist.
Mannequin class
Variety of fashions
Examples
NLP
157
BERT, BART, FasterTransformer, T5, Z-code MOE
Generative AI – NLP
40
LLaMA, CodeGen, GPT, OPT, BLOOM, Jais, Luminous, StarCoder, XGen
Generative AI – Picture
3
Secure diffusion v1.5 and v2.1, OpenAI CLIP
CV – Picture classification
45
ViT, ResNet, ResNext, MobileNet, EfficientNet
CV – Object detection
23
YOLO v2, v3, v4, v5, and v7, SSD-ResNet, RetinaNet
CV – Different
15
LPRNet, Tremendous-resolution/SRGAN, ByteTrack
Automotive networks*
53
Notion and LIDAR, pedestrian, lane, and visitors gentle detection
Whole
>300
* Most automotive networks are composite networks consisting of a fusion of particular person networks.
The massive on-die SRAM on the DL2q accelerator allows environment friendly implementation of superior efficiency strategies akin to MX6 micro-exponent precision for storing the weights and MX9 micro-exponent precision for accelerator-to-accelerator communication. The micro-exponent expertise is described within the following Open Compute Undertaking (OCP) {industry} announcement: AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Standardize Subsequent-Era Slim Precision Information Codecs for AI » Open Compute Undertaking.
The occasion consumer can use the next technique to maximise the performance-per-cost:
Retailer weights utilizing the MX6 micro-exponent precision within the on-accelerator DDR reminiscence. Utilizing the MX6 precision maximizes the utilization of the accessible reminiscence capability and the memory-bandwidth to ship best-in-class throughput and latency.
Compute in FP16 to ship the required use case accuracy, whereas utilizing the superior on-chip SRAM and spare TOPs on the cardboard, to implement high-performance low-latency MX6 to FP16 kernels.
Use an optimized batching technique and a better batch-size through the use of the massive on-chip SRAM accessible to maximise the reuse of weights, whereas retaining the activations on-chip to the utmost doable.
DL2q AI Stack and toolchain
The DL2q occasion is accompanied by the Qualcomm AI Stack that delivers a constant developer expertise throughout Qualcomm AI within the cloud and different Qualcomm merchandise. The identical Qualcomm AI stack and base AI expertise runs on the DL2q cases and Qualcomm edge units, offering prospects a constant developer expertise, with a unified API throughout their cloud, automotive, private pc, prolonged actuality, and smartphone growth environments.
The toolchain allows the occasion consumer to rapidly onboard a beforehand educated mannequin, compile and optimize the mannequin for the occasion capabilities, and subsequently deploy the compiled fashions for manufacturing inference use-cases in three steps proven within the following determine.
To study extra about tuning the efficiency of a mannequin, see the Cloud AI 100 Key Efficiency Parameters Documentation.
Get began with DL2q cases
On this instance, you compile and deploy a pre-trained BERT mannequin from Hugging Face on an EC2 DL2q occasion utilizing a pre-built accessible DL2q AMI, in 4 steps.
You should utilize both a pre-built Qualcomm DLAMI on the occasion or begin with an Amazon Linux2 AMI and construct your personal DL2q AMI with the Cloud AI 100 Platform and Apps SDK accessible on this Amazon Easy Storage Service (Amazon S3) bucket: s3://ec2-linux-qualcomm-ai100-sdks/newest/.
The steps that observe use the pre-built DL2q AMI, Qualcomm Base AL2 DLAMI.
Use SSH to entry your DL2q occasion with the Qualcomm Base AL2 DLAMI AMI and observe steps 1 via 4.
Step 1. Arrange the setting and set up required packages
Set up Python 3.8.
Arrange the Python 3.8 digital setting.
Activate the Python 3.8 digital setting.
Set up the required packages, proven within the necessities.txt doc accessible on the Qualcomm public Github web site.
Import the mandatory libraries.
Step 2. Import the mannequin
Import and tokenize the mannequin.
Outline a pattern enter and extract the inputIds and attentionMask.
Convert the mannequin to ONNX, which may then be handed to the compiler.
You’ll run the mannequin in FP16 precision. So, you must examine if the mannequin incorporates any constants past the FP16 vary. Go the mannequin to the fix_onnx_fp16 perform to generate the brand new ONNX file with the fixes required.
Step 3. Compile the mannequin
The qaic-exec command line interface (CLI) compiler instrument is used to compile the mannequin. The enter to this compiler is the ONNX file generated in step 2. The compiler produces a binary file (known as QPC, for Qualcomm program container) within the path outlined by -aic-binary-dir argument.
Within the compile command beneath, you utilize 4 AI compute cores and a batch dimension of 1 to compile the mannequin.
The QPC is generated within the bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc folder.
Step 4. Run the mannequin
Arrange a session to run the inference on a Cloud AI100 Qualcomm accelerator within the DL2q occasion.
The Qualcomm qaic Python library is a set of APIs that gives assist for operating inference on the Cloud AI100 accelerator.
Use the Session API name to create an occasion of session. The Session API name is the entry level to utilizing the qaic Python library.
Restructure the information from output buffer with output_shape and output_type.
Decode the output produced.
Listed here are the outputs for the enter sentence “The canine [MASK] on the mat.”
That’s it. With only a few steps, you compiled and ran a PyTorch mannequin on an Amazon EC2 DL2q occasion. To study extra about onboarding and compiling fashions on the DL2q occasion, see the Cloud AI100 Tutorial Documentation.
To study extra about which DL mannequin architectures are an excellent match for AWS DL2q cases and the present mannequin assist matrix, see the Qualcomm Cloud AI100 documentation.
Obtainable now
You possibly can launch DL2q cases at this time within the US West (Oregon) and Europe (Frankfurt) AWS Areas as On-demand, Reserved, and Spot Situations, or as a part of a Financial savings Plan. As typical with Amazon EC2, you pay just for what you utilize. For extra data, see Amazon EC2 pricing.
DL2q cases could be deployed utilizing AWS Deep Studying AMIs (DLAMI), and container photographs can be found by way of managed providers akin to Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS ParallelCluster.
To study extra, go to the Amazon EC2 DL2q occasion web page, and ship suggestions to AWS re:Publish for EC2 or by way of your typical AWS Help contacts.
Concerning the authors
A.Okay Roy is a Director of Product Administration at Qualcomm, for Cloud and Datacenter AI merchandise and options. He has over 20 years of expertise in product technique and growth, with the present focus of best-in-class efficiency and efficiency/$ end-to-end options for AI inference within the Cloud, for the broad vary of use-cases, together with GenAI, LLMs, Auto and Hybrid AI.
Jianying Lang is a Principal Options Architect at AWS Worldwide Specialist Group (WWSO). She has over 15 years of working expertise in HPC and AI discipline. At AWS, she focuses on serving to prospects deploy, optimize, and scale their AI/ML workloads on accelerated computing cases. She is enthusiastic about combining the strategies in HPC and AI fields. Jianying holds a PhD diploma in Computational Physics from the College of Colorado at Boulder.
[ad_2]
Source link