Morphous AI: Building Multi-Modal AI Workflows with Spring Boot and Vaadin

Ankit Agrahari
4 days ago
5 min read

Multi-modal AI workflows are becoming essential for modern backend systems. They allow applications to process and generate different types of data—text, images, audio—within a unified framework. This capability opens new possibilities for richer user experiences and more intelligent automation. The MorphousAI project, built with Spring Boot, Vaadin, and Spring AI, offers a practical example of how to implement such workflows in a backend-first Java environment. This article explores MorphousAI’s architecture, technology choices, and key implementation details, providing insights for senior Java developers and architects interested in AI integration.

Why Multi-Modal AI Matters for Backend Systems

AI models have traditionally focused on single data modalities, such as text or images. However, real-world applications often require handling multiple data types simultaneously. Multi-modal AI enables systems to understand and generate content across these modalities, such as converting text prompts into images or synthesizing speech from text.

For backend systems, supporting multi-modal AI means managing diverse data pipelines, integrating with AI services, and ensuring smooth communication between components. This complexity demands a robust architecture that can handle binary data (like images and audio), maintain statelessness for scalability, and secure sensitive API keys.

MorphousAI addresses these challenges by demonstrating a backend-first approach that integrates multi-modal AI workflows using Java and Spring technologies. This approach is particularly relevant for Java developers who want to build AI-powered applications without switching to other languages or frameworks.

Spring Boot & Vaadin - A Good Fit for AI-Driven Applications

Spring Boot is a natural choice for backend development in Java due to its simplicity, extensive ecosystem, and support for microservices and cloud-native architectures. It provides a solid foundation for building stateless services that can scale horizontally, which is crucial when dealing with AI workloads that may require significant compute resources.

Vaadin complements Spring Boot by offering a server-driven UI framework that allows developers to build rich web interfaces entirely in Java. Unlike traditional frontend frameworks that require separate JavaScript codebases, Vaadin generates the frontend automatically from Java components. This integration reduces context switching and keeps the development experience consistent.

In MorphousAI, Vaadin handles the user interface for interacting with AI features, such as submitting text prompts, uploading images, and playing audio. The frontend directory is fully generated by Vaadin, meaning developers focus on Java code without manually writing HTML, CSS, or JavaScript. This setup aligns well with the backend-first philosophy and leverages Spring Boot’s strengths.

The Overall Architecture of MorphousAI

MorphousAI’s architecture centers around a Spring Boot backend that exposes AI services and a Vaadin-based UI layer. The backend integrates with OpenAI models through Spring AI, handling requests for text-to-image generation, text-to-speech synthesis, and image uploads.

The system is designed to be stateless, with each request processed independently. This design supports scalability and simplifies deployment in cloud environments. Binary data such as images and audio files are handled efficiently, with appropriate streaming and encoding mechanisms.

The UI components built with Vaadin communicate with backend services via Spring-managed beans. This tight integration allows the UI to remain reactive and server-driven, avoiding the complexity of client-side state management.

Security is managed through environment-based configuration of API keys, ensuring sensitive credentials are not hardcoded. The architecture also supports extensibility, allowing new AI modalities or workflows to be added with minimal disruption.

Using Spring AI to Integrate with OpenAI Models

Spring AI provides a convenient abstraction layer for interacting with AI APIs, including OpenAI’s models. In MorphousAI, Spring AI manages authentication, request formatting, and response parsing, enabling developers to focus on business logic.

The integration involves defining service classes that invoke OpenAI endpoints for different AI tasks. For example, text-to-image generation uses the DALL·E model, while text-to-speech leverages OpenAI’s audio synthesis capabilities.

Spring AI’s configuration supports environment variables for API keys, allowing secure and flexible deployment. The framework also handles error scenarios gracefully, providing fallback mechanisms or retries as needed.

AI Service for Text-to-Image Generation

private final OpenAiImageModel imageModel;

public MorphousTextToImageService(OpenAiImageModel imageModel) {
    this.imageModel = imageModel;
}

public Image textToImage(String textPrompt){
    ImageOptions options = OpenAiImageOptions.builder()
                .model("dall-e-3")
                .width(1024)
                .height(1024)
                .style("vivid")
                .build();
    ImagePrompt prompt = new ImagePrompt(textPrompt, options);
    ImageResponse response = imageModel.call(prompt);
    return response.getResult().getOutput();
}

This service uses the OpenAI client to request an image based on a text prompt and then downloads the resulting image bytes for further processing or delivery.

Configuration Example (application.yml)

spring:
  application:
    name: MorphousAI
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        base-url: https://api.openai.com
        model: mistral
vaadin:
  launch-browser: true
  allowed-packages: com.vaadin,org.vaadin,com.flowingcode,org.backendbrilliance.morphousai

Using environment variables for API keys ensures that sensitive information is not stored in source code or configuration files directly.

Vaadin UI Components Calling Spring Services

Vaadin components in MorphousAI interact with backend services through Spring-managed beans. This approach keeps UI logic clean and leverages Spring’s dependency injection.

Here is an example of a Vaadin view that allows users to enter a text prompt and display the generated image:

private Component createTextToImage(){
        VerticalLayout ttiLayout = new VerticalLayout();
        TextArea textPrompt = new TextArea();
        textPrompt.setMinLength(20);
        textPrompt.setManualValidation(true);
        textPrompt.setLabel("Image generation prompt");

        Button convert = new Button("Convert");
        convert.addClickListener(click -> {
            String prompt = textPrompt.getValue();
            if(!textPrompt.isEmpty()){
                //Call AI service to convert to image, and return the image.
                org.springframework.ai.image.Image image = textToImageService.textToImage(prompt);
                Image generatedImage = new Image(DownloadHandler.forFile(new File(image.getUrl())), "generated image");
                ttiLayout.add(generatedImage);
            } else {
                textPrompt.setErrorMessage("Image generation prompt cannot be empty!!");
            }
        });

        ttiLayout.add(
                textPrompt, convert
        );
        return ttiLayout;
    }

This component demonstrates how the UI remains simple and declarative while delegating AI logic to the backend service.

Understanding Complex Concepts in MorphousAI

Multi-Modal AI

Multi-modal AI refers to systems that process and generate multiple types of data, such as text, images, and audio. MorphousAI supports workflows where a user can input text to generate images or audio, or upload images as inputs for further AI processing. This capability requires handling different data formats and integrating diverse AI models.

Text-to-Image Generation

Text-to-image generation converts descriptive text prompts into visual content. MorphousAI uses OpenAI’s DALL·E model for this task. The backend sends the prompt to the model, receives a URL to the generated image, and downloads the image bytes for display or further use.

Text-to-Speech Handling

Text-to-speech (TTS) converts text into spoken audio. MorphousAI integrates TTS by requesting audio data from OpenAI’s models, which return binary audio streams. The backend manages these streams, encoding them as base64 for transmission to the frontend, where Vaadin components play the audio.

Server-Driven UI with Vaadin

Vaadin’s server-driven UI model means that UI components are defined and managed on the server side in Java. The framework automatically generates the frontend code, handling client-server communication transparently. This approach simplifies development by keeping all logic in one language and environment.

Backend Concerns in MorphousAI

Stateless Services

MorphousAI’s services are stateless, meaning they do not store client-specific data between requests. This design supports scalability and fault tolerance, allowing multiple instances to handle requests independently.

Handling Binary Data

The project handles binary data such as images and audio carefully. For example, image bytes are downloaded from URLs and encoded as base64 strings for embedding in the UI. Audio streams are similarly processed to enable playback in browsers.

Security and API Key Management

API keys for OpenAI are managed through environment variables and injected into Spring configuration. This practice prevents accidental exposure of credentials and supports different environments (development, staging, production) without code changes.

Extensibility and Scaling

MorphousAI’s modular design allows adding new AI modalities or workflows by creating new service classes and UI components. The stateless nature and Spring Boot’s support for containerization make scaling straightforward.

This could be good starting point to leverage Spring AI integration and can work on enhancing more on it. This will require some credits on the OpenAI, because these are paid services for image and audio generation. If we use Ollama, we don't have a better model for image and audio generation, but explaining a image can be done using LLAVA model.

For Github Project link: https://github.com/ankitagrahari/MorphousAI?tab=readme-ov-file

Do suggest topics, suggestion or your experience using Spring AI, Vaadin and OpenAI models.

Happy learning :)