Description
This n8n workflow demonstrates how to automate image captioning tasks using Gemini 1.5 Pro - a multimodal LLM which can accept and analyse images. This is a really simple example of how easy it is to build and leverage powerful AI models in your repetitive tasks. How it works For this demo, we'll import a public image from a popular stock photography website, Pexel.com , into our workflow using the HTTP request node. With multimodal LLMs, there is little do preprocess other than ensuring the image dimensions fit within the LLMs accepted limits. Though not essential, we'll resize the image using the Edit image node to achieve fast processing. The image is used as an input to the basic LLM node by defining a "user message" entry with the binary (data) type. The LLM node has the Gemini 1.5 Pro language model attached and we'll prompt it to generate a caption title and text appropriate for the image it sees. Once generated, the generated caption text is positioning over the original image to complete the task. We can calculate the positioning relative to the amount of characters produced using the code node. An example of the combined image and caption can be found here: https://res.cloudinary.com/daglih2g8/image/upload/f_auto,q_auto/v1/n8n-workflows/l5xbb4ze4wyxwwefqmnc Requirements Google Gemini API Key. Access to Google Drive. Customising the workflow Not using Google Gemini? n8n's basic LLM node supports the standard syntax for image content for models that support it - try using GPT4o, Claude or LLava (via Ollama). Google Drive is only used for demonstration purposes. Feel free to swap this out for other triggers such as webhooks to fit your use case.
Key Features
1) Imports images from online sources such as Pexels using HTTP requests
2) Utilizes the Google Gemini 1.5 Pro multimodal LLM for AI-driven image caption generation
3) Supports multimodal input for visual tasks
enabling advanced classification or OCR
4) Overlays generated captions on images using user-friendly image editing features
5) Calculates optimal caption positioning dynamically to enhance visual appeal
6) Allows customization for various AI models beyond Google Gemini for flexible integration.
Required Tools
MetaobjectListDrop
