Talk to your slide deck using multimodal foundation models hosted on Amazon Bedrock and Amazon SageMaker – Part 1

[ad_1]

With the arrival of generative AI, in the present day’s basis fashions (FMs), equivalent to the massive language fashions (LLMs) Claude 2 and Llama 2, can carry out a variety of generative duties equivalent to query answering, summarization, and content material creation on textual content information. Nonetheless, real-world information exists in a number of modalities, equivalent to textual content, photographs, video, and audio. Take a PowerPoint slide deck, for instance. It may include data within the type of textual content, or embedded in graphs, tables, and photos.

On this submit, we current an answer that makes use of multimodal FMs such because the Amazon Titan Multimodal Embeddings mannequin and LLaVA 1.5 and AWS providers together with Amazon Bedrock and Amazon SageMaker to carry out comparable generative duties on multimodal information.

Resolution overview

The answer offers an implementation for answering questions utilizing data contained within the textual content and visible parts of a slide deck. The design depends on the idea of Retrieval Augmented Era (RAG). Historically, RAG has been related to textual information that may be processed by LLMs. On this submit, we prolong RAG to incorporate photographs as properly. This offers a robust search functionality to extract contextually related content material from visible parts like tables and graphs together with textual content.

There are other ways to design a RAG resolution that features photographs. We now have offered one method right here and can comply with up with an alternate method within the second submit of this three-part sequence.

This resolution consists of the next elements:

Amazon Titan Multimodal Embeddings mannequin – This FM is used to generate embeddings for the content material within the slide deck used on this submit. As a multimodal mannequin, this Titan mannequin can course of textual content, photographs, or a mixture as enter and generate embeddings. The Titan Multimodal Embeddings mannequin generates vectors (embeddings) of 1,024 dimensions and is accessed through Amazon Bedrock.
Giant Language and Imaginative and prescient Assistant (LLaVA) – LLaVA is an open supply multimodal mannequin for visible and language understanding and is used to interpret the information within the slides, together with visible parts equivalent to graphs and tables. We use the 7-billion parameter model LLaVA 1.5-7b on this resolution.
Amazon SageMaker – The LLaVA mannequin is deployed on a SageMaker endpoint utilizing SageMaker internet hosting providers, and we use the ensuing endpoint to run inferences in opposition to the LLaVA mannequin. We additionally use SageMaker notebooks to orchestrate and show this resolution finish to finish.
Amazon OpenSearch Serverless – OpenSearch Serverless is an on-demand serverless configuration for Amazon OpenSearch Service. We use OpenSearch Serverless as a vector database for storing embeddings generated by the Titan Multimodal Embeddings mannequin. An index created within the OpenSearch Serverless assortment serves because the vector retailer for our RAG resolution.
Amazon OpenSearch Ingestion (OSI) – OSI is a completely managed, serverless information collector that delivers information to OpenSearch Service domains and OpenSearch Serverless collections. On this submit, we use an OSI pipeline to ship information to the OpenSearch Serverless vector retailer.

Resolution structure

The answer design consists of two elements: ingestion and person interplay. Throughout ingestion, we course of the enter slide deck by changing every slide into a picture, generate embeddings for these photographs, after which populate the vector information retailer. These steps are accomplished previous to the person interplay steps.

Within the person interplay section, a query from the person is transformed into embeddings and a similarity search is run on the vector database to discover a slide that might doubtlessly include solutions to person query. We then present this slide (within the type of a picture file) to the LLaVA mannequin and the person query as a immediate to generate a solution to the question. All of the code for this submit is offered within the GitHub repo.

The next diagram illustrates the ingestion structure.

The workflow steps are as follows:

Slides are transformed to picture recordsdata (one per slide) in JPG format and handed to the Titan Multimodal Embeddings mannequin to generate embeddings. On this submit, we use the slide deck titled Practice and deploy Steady Diffusion utilizing AWS Trainium & AWS Inferentia from the AWS Summit in Toronto, June 2023, to show the answer. The pattern deck has 31 slides, so we generate 31 units of vector embeddings, every with 1,024 dimensions. We add further metadata fields to those generated vector embeddings and create a JSON file. These further metadata fields can be utilized to carry out wealthy search queries utilizing OpenSearch’s highly effective search capabilities.
The generated embeddings are put collectively in a single JSON file that’s uploaded to Amazon Easy Storage Service (Amazon S3).
By way of Amazon S3 Occasion Notifications, an occasion is put in an Amazon Easy Queue Service (Amazon SQS) queue.
This occasion within the SQS queue acts as a set off to run the OSI pipeline, which in flip ingests the information (JSON file) as paperwork into the OpenSearch Serverless index. Observe that the OpenSearch Serverless index is configured because the sink for this pipeline and is created as a part of the OpenSearch Serverless assortment.

The next diagram illustrates the person interplay structure.

The workflow steps are as follows:

A person submits a query associated to the slide deck that has been ingested.
The person enter is transformed into embeddings utilizing the Titan Multimodal Embeddings mannequin accessed through Amazon Bedrock. An OpenSearch vector search is carried out utilizing these embeddings. We carry out a k-nearest neighbor (okay=1) search to retrieve essentially the most related embedding matching the person question. Setting okay=1 retrieves essentially the most related slide to the person query.
The metadata of the response from OpenSearch Serverless accommodates a path to the picture akin to essentially the most related slide.
A immediate is created by combining the person query and the picture path and supplied to LLaVA hosted on SageMaker. The LLaVA mannequin is ready to perceive the person query and reply it by analyzing the information within the picture.
The results of this inference is returned to the person.

These steps are mentioned intimately within the following sections. See the Outcomes part for screenshots and particulars on the output.

Stipulations

To implement the answer supplied on this submit, you must have an AWS account and familiarity with FMs, Amazon Bedrock, SageMaker, and OpenSearch Service.

This resolution makes use of the Titan Multimodal Embeddings mannequin. Be certain that this mannequin is enabled to be used in Amazon Bedrock. On the Amazon Bedrock console, select Mannequin entry within the navigation pane. If Titan Multimodal Embeddings is enabled, the entry standing will state Entry granted.

If the mannequin shouldn’t be out there, allow entry to the mannequin by selecting Handle Mannequin Entry, deciding on Titan Multimodal Embeddings G1, and selecting Request mannequin entry. The mannequin is enabled to be used instantly.

Use an AWS CloudFormation template to create the answer stack

Use one of many following AWS CloudFormation templates (relying in your Area) to launch the answer sources.

AWS Area
Hyperlink

us-east-1

us-west-2

After the stack is created efficiently, navigate to the stack’s Outputs tab on the AWS CloudFormation console and word the worth for MultimodalCollectionEndpoint, which we use in subsequent steps.

The CloudFormation template creates the next sources:

IAM roles – The next AWS Identification and Entry Administration (IAM) roles are created. Replace these roles to use least-privilege permissions.

SMExecutionRole with Amazon S3, SageMaker, OpenSearch Service, and Bedrock full entry.
OSPipelineExecutionRole with entry to particular Amazon SQS and OSI actions.

SageMaker pocket book – All of the code for this submit is run through this pocket book.
OpenSearch Serverless assortment – That is the vector database for storing and retrieving embeddings.
OSI pipeline – That is the pipeline for ingesting information into OpenSearch Serverless.
S3 bucket – All information for this submit is saved on this bucket.
SQS queue – The occasions for triggering the OSI pipeline run are put on this queue.

The CloudFormation template configures the OSI pipeline with Amazon S3 and Amazon SQS processing as supply and an OpenSearch Serverless index as sink. Any objects created within the specified S3 bucket and prefix (multimodal/osi-embeddings-json) will set off SQS notifications, that are utilized by the OSI pipeline to ingest information into OpenSearch Serverless.

The CloudFormation template additionally creates community, encryption, and information entry insurance policies required for the OpenSearch Serverless assortment. Replace these insurance policies to use least-privilege permissions.

Observe that the CloudFormation template identify is referenced in SageMaker notebooks. If the default template identify is modified, ensure you replace the identical in globals.py

Take a look at the answer

After the prerequisite steps are full and the CloudFormation stack has been created efficiently, you’re now prepared to check the answer:

On the SageMaker console, select Notebooks within the navigation pane.
Choose the MultimodalNotebookInstance pocket book occasion and select Open JupyterLab.
In File Browser, traverse to the notebooks folder to see the notebooks and supporting recordsdata.

The notebooks are numbered within the sequence during which they’re run. Directions and feedback in every pocket book describe the actions carried out by that pocket book. We run these notebooks one after the other.

Select 0_deploy_llava.ipynb to open it in JupyterLab.
On the Run menu, select Run All Cells to run the code on this pocket book.

This pocket book deploys the LLaVA-v1.5-7B mannequin to a SageMaker endpoint. On this pocket book, we obtain the LLaVA-v1.5-7B mannequin from HuggingFace Hub, change the inference.py script with llava_inference.py, and create a mannequin.tar.gz file for this mannequin. The mannequin.tar.gz file is uploaded to Amazon S3 and used for deploying the mannequin on SageMaker endpoint. The llava_inference.py script has further code to permit studying a picture file from Amazon S3 and operating inference on it.

Select 1_data_prep.ipynb to open it in JupyterLab.
On the Run menu, select Run All Cells to run the code on this pocket book.

This pocket book downloads the slide deck, converts every slide into JPG file format, and uploads these to the S3 bucket used for this submit.

Select 2_data_ingestion.ipynb to open it in JupyterLab.
On the Run menu, select Run All Cells to run the code on this pocket book.

We do the next on this pocket book:

We create an index within the OpenSearch Serverless assortment. This index shops the embeddings information for the slide deck. See the next code:

session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)

os_client = OpenSearch(
hosts = [{‘host’: host, ‘port’: 443}],
http_auth = auth,
use_ssl = True,
verify_certs = True,
connection_class = RequestsHttpConnection,
pool_maxsize = 20
)

index_body = “””
{
“settings”: {
“index.knn”: true
},
“mappings”: {
“properties”: {
“vector_embedding”: {
“kind”: “knn_vector”,
“dimension”: 1024,
“methodology”: {
“identify”: “hnsw”,
“engine”: “nmslib”,
“parameters”: {}
}
},
“image_path”: {
“kind”: “textual content”
},
“metadata”: {
“properties”: {
“slide_filename”: {
“kind”: “textual content”
},
“model_id”: {
“kind”: “textual content”
},
“slide_description”: {
“kind”: “textual content”
}
}
}
}
}
}
“””
index_body = json.masses(index_body)
attempt:
response = os_client.indices.create(index_name, physique=index_body)
logger.information(f”response acquired for the create index -> {response}”)
besides Exception as e:
logger.error(f”error in creating index={index_name}, exception={e}”)

We use Titan Multimodal Embeddings mannequin to transform the JPG photographs created within the earlier pocket book into vector embeddings. These embeddings and extra metadata (such because the S3 path of the picture file) are saved in a JSON file and uploaded to Amazon S3. Observe {that a} single JSON file is created, which accommodates paperwork for all of the slides (photographs) transformed into embeddings. The next code snippet reveals how a picture (within the type of a Base64 encoded string) is transformed into embeddings:

def get_multimodal_embeddings(bedrock: botocore.shopper, picture: str) -> np.ndarray:
physique = json.dumps(dict(inputImage=picture))
attempt:
response = bedrock.invoke_model(
physique=physique, modelId=g.FMC_MODEL_ID, settle for=g.ACCEPT_ENCODING, contentType=g.CONTENT_ENCODING
)
response_body = json.masses(response.get(“physique”).learn())
embeddings = np.array([response_body.get(“embedding”)]).astype(np.float32)
besides Exception as e:
logger.error(f”exception whereas picture(truncated)={picture[:10]}, exception={e}”)
embeddings = None

return embeddings

This motion triggers the OpenSearch Ingestion pipeline, which processes the file and ingests it into the OpenSearch Serverless index. The next is a pattern of the JSON file created. (A vector with 4 dimensions is proven within the instance code. The Titan Multimodal Embeddings mannequin generates 1,024 dimensions.)

[
{
“image_path”: “s3://<your-bucket-name>/path/to/file1.json”,
“metadata”: {
“slide_filename”: “mypowerpoint1.pptx”,
“model_id”: “amazon.titan-embed-image-v1”,
“slide_description”: “This is a test slide deck”
},
“vector_embedding”: [
657.6052386529958,
0.8865137233123771,
763.870264592026
]
}
]

Select 3_rag_inference.ipynb to open it in JupyterLab.
On the Run menu, select Run All Cells to run the code on this pocket book.

This pocket book implements the RAG resolution: we convert the person query into embeddings, discover a comparable picture (slide) from the vector database, and supply the retrieved picture to LLaVA to generate a solution to the person query. We use the next immediate template:

prompt_template: str = “””Faux that you’re a useful assistant that solutions questions on content material in a slide deck.
Utilizing solely the knowledge within the supplied slide picture reply the next query.
If you don’t discover the reply within the picture then say I didn’t discover the reply to this query within the slide deck.

{query}
“””

The next code snippet offers the RAG workflow:

# create immediate and convert to embeddings
query: str = “As per the AI/ML flywheel, what do the AWS AI/ML providers present?”
immediate = prompt_template.format(query=query)
text_embeddings = get_text_embeddings(bedrock, query)

# vector db search
vector_db_response: Dict = find_similar_data(text_embeddings)

# obtain picture for native pocket book show
s3_img_path = vector_db_response.get(‘hits’, {}).get(‘hits’)[0].get(‘_source’).get(‘image_path’)
logger.information(f”going to reply the query=”{query}” utilizing the picture “{s3_img_path}””)

!aws s3 cp {s3_img_path} .
local_img_path = os.path.basename(s3_img_path)
show(Picture(filename=local_img_path))

# Ask LLaVA
information = {
“picture” : s3_img_path,
“query” : immediate,
“temperature” : 0.1,
}
output = predictor.predict(information)
logger.information(f”Picture={s3_img_path}nQuestion: {query}nAnswer: {output}nn”)

Outcomes

The next is a compilation of some person questions and responses generated by our implementation. The Query column captures the person query, and the Reply column is the textual response generated by LLaVA. Picture is the k-nearest (okay=1) slide match returned by the OpenSearch Serverless vector search. This slide is interpreted by LLaVA to supply the reply.

Multimodal RAG outcomes

Query
Reply
Picture

How does Inf2 evaluate in efficiency to comparable EC2 situations? I would like numbers.
In line with the slide deck, Inf2 situations by AWS Inferentia2 supply as much as 4x increased throughput and 10x decrease latency in comparison with comparable EC2 situations.

As per the AI/ML flywheel, what do the AWS AI/ML providers present?
The AWS AI/ML providers present higher $/perfer capabilities, new capabilities, and funding in innovation.

In comparison with GPT-2, what number of extra parameters does GPT-3 have? What’s the numerical distinction between the parameter measurement of GPT-2 and GPT-3?
In line with the slide, GPT-3 has 175 billion parameters, whereas GPT-2 has 1.5 billion parameters. The numerical distinction between the parameter measurement of GPT-2 and GPT-3 is 173.5 billion.

What are quarks in particle physics?
I didn’t discover the reply to this query within the slide deck.

Be at liberty to increase this resolution to your slide decks. Merely replace the SLIDE_DECK variable in globals.py with a URL to your slide deck and run the ingestion steps detailed within the earlier part.

Tip

You need to use OpenSearch Dashboards to work together with the OpenSearch API to run fast checks in your index and ingested information. The next screenshot reveals an OpenSearch dashboard GET instance.

Clear up

To keep away from incurring future expenses, delete the sources you created. You are able to do this by deleting the stack through the CloudFormation console.

Moreover, delete the SageMaker inference endpoint created for LLaVA inferencing. You are able to do this by uncommenting the cleanup step in 3_rag_inference.ipynb and operating the cell, or by deleting the endpoint through the SageMaker console: select Inference and Endpoints within the navigation pane, then choose the endpoint and delete it.

Conclusion

Enterprises generate new content material on a regular basis, and slide decks are a typical mechanism used to share and disseminate data internally with the group and externally with clients or at conferences. Over time, wealthy data can stay buried and hidden in non-text modalities like graphs and tables in these slide decks. You need to use this resolution and the facility of multimodal FMs such because the Titan Multimodal Embeddings mannequin and LLaVA to find new data or uncover new views on content material in slide decks.

We encourage you to study extra by exploring Amazon SageMaker JumpStart, Amazon Titan fashions, Amazon Bedrock, and OpenSearch Service, and constructing an answer utilizing the pattern implementation supplied on this submit.

Look out for 2 further posts as a part of this sequence. Half 2 covers one other method you can take to speak to your slide deck. This method generates and shops LLaVA inferences and makes use of these saved inferences to answer person queries. Half 3 compares the 2 approaches.

In regards to the authors

Amit Arora is an AI and ML Specialist Architect at Amazon Internet Companies, serving to enterprise clients use cloud-based machine studying providers to quickly scale their improvements. He’s additionally an adjunct lecturer within the MS information science and analytics program at Georgetown College in Washington D.C.

Manju Prasad is a Senior Options Architect inside Strategic Accounts at Amazon Internet Companies. She focuses on offering technical steering in a wide range of domains, together with AI/ML to a marquee M&E buyer. Previous to becoming a member of AWS, she designed and constructed options for firms within the monetary providers sector and in addition for a startup.

Archana Inapudi is a Senior Options Architect at AWS supporting strategic clients. She has over a decade of expertise serving to clients design and construct information analytics and database options. She is enthusiastic about utilizing expertise to supply worth to clients and obtain enterprise outcomes.

Antara Raisa is an AI and ML Options Architect at Amazon Internet Companies supporting strategic clients based mostly out of Dallas, Texas. She additionally has earlier expertise working with massive enterprise companions at AWS, the place she labored as a Companion Success Options Architect for digital native clients.