With incredibly powerful text generation and image generation models such as GTP2, BERT, Bloom 176B, or Stable Diffusion now available to anyone with access to a handful or even a single GPU, their application is still restricted by two critical factors: inference latency and cost.
What is DeepSpeed-MII?
DeepSpeed-MII is a new open-source Python library from DeepSpeed, aimed at making low-latency, low-cost inference of powerful models not only feasible but also easily accessible.
- MII offers access to highly optimized implementations of thousands of DL models.
- MII-supported models achieve significantly lower latency and cost compared to their original implementation.
- To enable low latency/cost inference, MII leverages an extensive set of optimizations from DeepSpeed-Inference such as deep fusion for transformers, automated tensor-slicing for multi-GPU inference, on-the-fly quantization with ZeroQuant, and others.
Native Deployment Options
DeepSpeed MII comes pre-packed with 2 deployment options
- Local Deployment with gRPC server
- Azure ML Endpoints
The local deployments offered by DeepSpeed MII has an overhead of running an extra gRPC for model inferencing and Azure ML dependency in Azure Cloud.
Being bound to Azure Machine Learning or gRPC kind of deployment, it makes serving the optimized model in other model servers difficult.
Non-Native Deployment Example:
This blog covers how to serve a DeepSpeed MII Optimized Stable Diffusion Model via Torchserve by bye-passing the default deployment options offered by DeepSpeed MII.
Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
Refer: https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work
Before using the model, you need to accept the model license in order to download and use the weights.
For access tokens refer: https://huggingface.co/docs/hub/security-tokens
Below is a sample Python implementation of the stable diffusion model optimized with DeepSpeed MII without the native deployments.
The script requires pillow, deepspeed-mii packages, huggingface-hub
Sample Python script
Inference query
python dsmii.py --model_name CompVis/stable-diffusion-v1-4 --prompt “A
“A photo of a golden retriever puppy wearing a shirt. Background office”
Torchserve Implementation:
Compress Model:
Zip the folder where the model is saved. In this case
cd model
zip -r * ../model.zip
Generate MAR File:
torch-model-archiver --model-name stable-diffusion --version 1.0 --handler custom_handler.py --extra-files model.zip -r requirements.txt
Start Torchserve
Config.properties
torchserve --start --ts-config config.properties
Run Inference:
python query.py --url "http://localhost:8080/predictions/stable-diffusion" --prompt "a photo of an astronaut riding a horse on mars"
The image generated will be written to a file output-20221027213010.jpg
Conclusion
The above blog describes how DeepSpeed MII can be used without the gRPC or Azure ML deployments. This solves the problem of having the gRPC server running in your local deployment or Azure ML dependency in the cloud deployment. Thus enabling the use of DeepSpeed MII optimized models to be used in other serving solutions.