Deploy Meta-Llama-3 models on AWS Sagemaker

Lavaraja Padala
3 min readApr 22, 2024

--

Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks

***Important Note: Regarding whether Llama 3 license model, it’s not a standard open-source license like MIT or GPL. While it allows for certain freedoms in using and modifying the materials, it also imposes restrictions and requirements, particularly regarding attribution and commercial use. It’s more of a proprietary license with some open elements.

If you want to use this model, you should carefully review and comply with all the terms of the license agreement. If you have any doubts or concerns, consider seeking legal from your legal team to comprehensively understand the Meta license agreement and its implications within your organization when assessing the model for your use cases.

Pre-requisites:

Step 1:

  1. Get access to the Huggingface model repository. This requires signing up.
  2. 2. Once the access is granted, generate a Hugging Face token to access the model. https://huggingface.co/settings/tokens

Step 2:

Currently Llama-3 models is available as Meta-Llama-3–8B, Meta-Llama-3–8B-Instruct and Meta-Llama-3–70B, Meta-Llama-3–70B-Instruct. The supported transformers version for running inference with the model is 4.40.0.dev0.

“transformers_version”: “4.40.0.dev0

Deploying on Sagemaker:

As of today, April 22nd , 2024, the most recent HuggingFace Text Generation Inference DLC container available on Sagemaker utilizes transformers version 4.39.3. To deploy the model on Sagemaker, it’s necessary to update the existing DLC container with the latest version(4.40.0.dev0) of the transformers library.

Steps to extend the DLC container:

Download the latest DLC container image.

aws ecr get-login-password - region us-east-1 | docker login - username AWS - password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.1-tgi2.0.0-gpu-py310-cu121-ubuntu22.04-v2.0

Create a docker file to update image with latest transformers version:

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.1-tgi1.4.0-gpu-py310-cu121-ubuntu20.04-v1.0
RUN pip install -U transformers
$ docker build -t huggingface-pytorch-tgi-inference:2.1.1-tgi2.0.0-gpu-py310-cu121-ubuntu22.04-v2.0 .

Push the custom Docker image to ECR repository in your AWS account:

aws ecr get-login-password - region us-east-1 | docker login - username AWS - password-stdin <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1-code.amazonaws.com
docker tag huggingface-pytorch-tgi-inference:2.1.1-tgi2.0.0-gpu-py310-cu121-ubuntu22.04-v2.0 <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:latest
docker push <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:latest

Create Endpoint on Sagemaker:


import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'meta-llama/Meta-Llama-3-8B',
'SM_NUM_GPUS': json.dumps(4),
'HUGGING_FACE_HUB_TOKEN': '<REPLACE WITH YOUR TOKEN>',
#'HF_MODEL_QUANTIZE': "bitsandbytes" # enable quantization run with lower precision
}

assert hub['HUGGING_FACE_HUB_TOKEN'] != '<REPLACE WITH YOUR TOKEN>', "You have to provide a token."

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri="<AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:latest",#get_huggingface_llm_image_uri("huggingface",version="1.4.2"),
env=hub,
role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.12xlarge",
container_startup_health_check_timeout=300,
)

Inference:

# send request
predictor.predict({
"inputs": "My name is Julien and I like to",
})

Output:

[{'generated_text': 'My name is Julien and I like to drink craft beers. I’ve also been a photographer since I was 17-year-old. I documented my beer travels and started this website 5 years ago to equip beer drinkers around the globe to discover and enjoy great beers.\nNew wave of Craft Beer in old century buildings and the role of UNESCO\nSarah Sajoo of UNESCO London interviewed most of the organisers of festivals and societies behind historical surroundings in co-operation with local craft breweries.\nUNESCO World Day for Cultural Diversity\nCultural diversity together with'}]

--

--

Lavaraja Padala

Big Data/AI/ML/Sagemaker Support Engineer at AWS. Views or opinions expressed here are completely my own and have no relation to AWS.