LLM Evaluation & Comparison

LLM Evaluation Project Types with Multi-Provider Integration

Generative AI Lab provides specialized project types for evaluating and comparing responses from large language models (LLMs):

LLM Evaluation – assess responses from a single LLM based on defined criteria.
LLM Comparison – compare side-by-side responses from two or more LLMs for the same prompt.

These projects enable structured model assessment, supporting both qualitative and quantitative scoring within the same collaborative annotation environment.

Supported LLM Providers

Generative AI Lab integrates with multiple leading LLM providers:

OpenAI
Azure OpenAI
Amazon SageMaker
Anthropic Claude

You can configure one or more providers globally from System Settings → Integration, where credentials for each service can be securely added.
The procedure is consistent across providers—enter the required API keys or access tokens, validate, and save.
Once configured, these providers are available for selection when creating LLM Evaluation or Comparison projects.

720image

Creating an LLM Evaluation Project

Navigate to the Projects page and click New.
After filling in the project details and assigning to the project team, proceed to the Configuration page.
Under the Text tab on step 1 - Content Type, select LLM Evaluation task and click on Next.
On the Select LLM Providers page, you can either:
- Click Add button to create an external provider specific to the project (this provider will only be used within this project), or
- Click Go to External Service Page to be redirected to Integration page, associate the project with one of the supported external LLM providers, and return to Project → Configuration → Select LLM Response Provider
Choose the provider you want to use, save the configuration and click on Next.
Customize labels and choices as needed in the Customize Labels section, and save the configuration.

720image

For LLM Evaluation Comparison projects, follow the same steps, but associate the project with two different external providers and select both on the LLM Response Provider page.

Importing Prompts for LLM Evaluation (No Pre-Filled Responses)

To start working with prompts:

Go to the Tasks page and click Import.
Upload your prompts in either .json or .zip format using the structure below.

Sample JSON for LLM Evaluation Project

{
  "data": {
    "prompt": "Give me a diet plan for a diabetic 35 year old with reference links",
    "response1": "",
    "title": "DietPlan"
  }
}

Sample JSON for LLM Evaluation Comparison Project

{
  "data": {
    "prompt": "Give me a diet plan for a diabetic 35 year old with reference links",
    "response1": "",
    "response2": "",
    "title": "DietPlan"
  }
}

Once the prompts are imported as tasks, click the Generate Response button to fetch LLM responses directly from the configured providers.

720image

After the responses are generated, users can begin evaluating them directly within the task interface.

Importing Promtps and LLM Responses for Evaluation

Users can also import prompts and LLM-generated responses using a structured JSON format. This feature supports both LLM Evaluation and LLM Evaluation Comparison project types.

Below are example JSON formats:

LLM Evaluation: Includes a prompt and one LLM response mapped to a provider.
LLM Evaluation Comparison: Supports multiple LLM responses to the same prompt, allowing side-by-side evaluation.

Sample JSON for LLM Evaluation Project with Response

{
  "data": {
    "prompt": "Give me a diet plan for a diabetic 35 year old with reference links",
    "response1": "Prompt Respons1 Here",
    "llm_details": [
      { "synthetic_tasks_service_provider_id": 1, "response_key": "response1" }
    ],
    "title": "DietPlan"
  }
}

Sample JSON for LLM Evaluation Comparision Project with Response

{
  "data": {
    "prompt": "Give me a diet plan for a diabetic 35 year old with reference links",
    "response1": "Prompt Respons1 Here",
    "response2": "Prompt Respons2 Here",
    "llm_details": [
      { "synthetic_tasks_service_provider_id": 1, "response_key": "response1" },
       { "synthetic_tasks_service_provider_id": 2, "response_key": "response2" }
    ],
    "title": "DietPlan"
  }
}

Analytics Dashboard for LLM Evaluation Projects

A dedicated analytics tab provides quantitative insights for LLM evaluation projects:

Bar graphs for each evaluation label and choice option
Statistical summaries derived from submitted completions
Multi-annotator scenarios prioritize submissions from highest-priority users
Analytics calculations exclude draft completions (submitted tasks only)

720image

The general workflow for these projects aligns with the existing annotation flow in Generative AI Lab. The key difference lies in the integration with external LLM providers and the ability to generate model responses directly within the application for evaluation.

These new project types provide teams with a structured approach to assess and compare LLM outputs efficiently, whether for performance tuning, QA validation, or human-in-the-loop benchmarking.

PREVIOUSRelease Notes