IBM / unitxt
81%
main: 81%

Repo Added 24 Dec 2024 03:17PM UTC

Files 64

Badge

LAST BUILD ON BRANCH fix-mt_bench-style-llm-as-judge-post-processor
branch: SELECT

CHANGE BRANCH
x

No branch selected

1.16.1

1.16.2

1.16.3

1.16.4

1.17.0

1.17.1

1.17.2

1.18.0

1.19.0

1.20.0

1.21.0

1.22.1

1.22.3

1.23.0

1.23.1

1.24.0

1.25.0

1.26.0

1.26.1

1.26.10

1.26.2

1.26.3

1.26.4

1.26.5

1.26.6

1.26.7

1.26.8

1.26.9

2024-blog

Add-multiple-choice-example

Added-example-for-standalone-metric-evaluation

Added-param-to-control-of-confidence-interval-calculation-in-evaluate-api

Documenation-updates

Example-of-creating-yaml-representation-of-card

Fix-LoadJsonFile

LoadFromAPI-optional-apikey

accelerate-rag-metrics

add-audio-support

add-balance-operator

add-cache-gitignore

add-cross-inference-models

add-docstring-llm-judge

add-engine-id-method

add-format-and-system-prompt-to-meta-data

add-global-mmlu-lite-sensitivity-cards

add-granite-docs-format

add-hf-to-cross-provider-inference-engine

add-inline-template-support

add-metric-example

add-more-judges

add-more-llmjudge-benchmarks

add-more-metrics-for-schema-linking

add-non-verify-option-to-api-loader

add-quality-dataset

add-replicate

add-schema-linking

add-spacy-req-to-examples-tests

add-text2sql

add-text2sql-blog-post

add-to_markdown-to-instance-score

add-to_yaml-for-artifiact

add-tokenizer-name

add-vision-benchmark-example

add-vllm-to-cross

add_completeness_judge

add_entity_type_filter_to_operators

add_generation_text_to_meta_data

add_judges

add_metadata

added_social_iqa_card

airbench

allow-read-timeout

an_issue_with_loader_cache

api_call_evaluation

arc-indic-rudra

arena-hard-fix

assistant-improve-links

assistant_assessment

assitant-with-search

atta_q_safety

azure

banner-top-website

base-dep

batch-size-inference

bench-and-models

bench-recipe-in-cli

benjams/add_bioasq_miniwiki_datasets

benjams/add_hotpotqa

benjams/add_watson_x

benjams/enrich_tags

benjams/fix_bioasq_card

benjams/fix_clap_nq_benchmark

benjams/fix_clapnq

benjams/fix_watsonx_qa_dataset

biggen-bench

biggen-multilingual

biggen-revert

blog-update

cache-key-and-lock

ccc_inference

changes

chat_api_format

cli-benchmark-fix

cli-enhancements

cli-imports

cli-util

clinc-faster

codecov

comment-out-sql

convert-inline-templates

correct_tool_calling

correcteness-criteria

criteria-typo

criterias

cross-inference-add-model

cross-inference-custom-model

csv-loader

data-classification-cross-provider-engine

datasets351

dedup-operator

default-template-policy

demos-sampling-seed

demos_experimental

disable-litellm-cache

down-dount

ds-4-req

empty_yaml_strings

entity_squad_metric

eval_assist_documentation

evalassist-judges

evaluate_different_formats

extend-choices-order

extend_coverage_some

external_client_for_wml_infer_engine

f1-docs

feature/add-global-mmlu-cards

filter_if_missing_field

filter_wikitq

finqa-hash-to-top

fix-DiverseLabelSampler

fix-artifact-saving

fix-aus-legal-qa

fix-azure-llmjudge

fix-azure-openai

fix-batching

fix-bench-docs

fix-bird-task

fix-bootstrap-empty

fix-bug-when-WML-does-not-return-any-content-or-tool-call

fix-cache-dir

fix-catalog

fix-criteria-json

fix-datasets-4

fix-dependencies-installation

fix-disable-mem-caching

fix-examples

fix-fusion

fix-images-demos-pool

fix-inference

fix-inference-tests

fix-issue-in-token-decosing

fix-litellm-without-task-data

fix-load-csv

fix-loaders-trust

fix-loading2

fix-metrics-docs

fix-missing-dataset

fix-model-name

fix-mt_bench-style-llm-as-judge-post-processor

fix-multiple-source-loader

fix-nan-ci

fix-number-of-batchs

fix-pearsonr-tests

fix-qa-evaluation-data-classification-policy

fix-rag-metrics

fix-rits-model-names

fix-scout-name

fix-some-tests

fix-tablebench-dp-split

fix-task-metrics

fix-tests

fix-tests-sacrebleu-ja

fix-text2sql_utils-sort_df

fix-tools-nested-params

fix-typo-in-azure-openai-variable-name-and-dictionary-key

fix-vision

fix-zero-division-in-compare-performances

fix/catalog-prep-hf-login

fix/correct-choice-position-handling

fix/disable-milu-test-gated-dataset

fix/hf-namespaced-dataset-paths

fix/inference-tests-model-updates

fix/negative-index-support

fix_assistance_token_error

fix_bfcl

fix_global_mmlu

fix_llmjudge

fix_mmmu

fix_mtrag

fix_ollama

fix_performance_test

fix_prompts_table_benchmark

fix_rag_metrics

fix_summarize_from_human_feedback

fix_torr_vulnerable_dependencies

fix_xlam_function_calling

fixed-bug-in-tool-inference

fixed_wiki_bio

fixing_criterias_in_catalog

frames

from_api_import

function-operators

gg-add-prompt-to-result

gg-fc-fix

gg-hf

gg-prediction-field

global-mmlu-improvment

gpqa

granite-guardian-minor-changes

granite-guardian-result-type

granite-guardian-support

groupby_processor

handle_empty_tool_call_list

head-qa-updates

helm-test-fix

hf-cache

hf-files

hf-retry

hf-timeout

hf-tool-calling

hf_pipeline_peft

homepage

hub-rust

image_key_value_extrqaction

imports_html_button

improve-assistant

improve-context-parsing

improve-score-option-selection

improve-tc-example

improve__instance_scores_summary

improve_inference_log

improve_merge_error_message

improved-error-messages

improved-parsing-of-MT-bench-style-rating-parsing

improved_multi_turn_example

indic_milu

inference_engine_cache

issue-1881

issues-stale

jb/fix-arena-hard-template

jb/fix-cli

jb/gg-hack

jb/provoq-updates

jb/replicate-models

jb/safety-updates

json

json-loader

jsonschema

just_lazy_loader

just_to_run_examples

key_value_extraction_improvements

know_your_splits

last_line_processor

lazy-return-multi-stream

lazy_evaluate

lazy_loadHF

lazy_scipy

llm-as-judge-metric-update-again

llm-judge-cot

llm-judge-granite-evals

llm-judge-judgebench

llm-judge-prepare

llm-judge-response-name

llm-judge-str-evaluator-name

llm-judge-summaries

llm-judge-use-cross-provider

llmjudge-add-prompts-by-default

llmjudge-changes

load_dataset_use_cache_default

local-cache

log-probs-hf-fix

long-bench

main

meteor_n_resample

metric_based_ner

metrics-formatting

metrics_fix

mistral_small_watsonx_support

mixed_args_support

mlcommons-ailuminate

mm_updates

mock-performence

module_name_same_catalog

more-bluebench-fixes

mtrag

mtrag_corpora

multi-turn-metrics

multi_turn_rag_example

multiple-choice-improved

multithreading-support

nave_tool_calling

ner_example

networkx

new-base-metric

new-text2sql-metrics-scores

no_iterable_datasets

no_loader_cache

normalize-bench-target

nve_tool_calling

ollama

ollama-host

ollama_inference

override_ci_method_globalmetric

pandas-403

patch-1

peft

performance_blue_benchmark

performance_no_cProfile

performance_no_cProfile_existing_loaders

pipeline_tokenizer

place-correct-choice-position

polish_performance

prediction-type-without-load

prep-tests

preparation3

prevent-ds-4

protobuf

provider-specific-args-and-allow-unroecognized-model-name

pythonize_the_yaml

rag-bench

rag-metric-update-again

ragbench

readme-update

real_mm_rag

refactor-inference

refactor-llm-ad-judge-to-map-reduce

reflector-integration

reflector-semantic-integration

remote_catalog

remove-balance-new

remove-ibm-branding-from-doc

remove-src-lock

remove_bam_llm_as_judges

remove_break_point

remove_ds351_installation

remove_genai_support

remove_gpqa_experts

remove_redundant_from_performance_yml

renovate/configure

return_source_to_recipe_to_performance

reuse-hf-cache-for-actions

reuters-improvments

rits_infer

safety-benchmark

safety_airbench2024

security/fix-cwe95-eval-injection

settings-docs

simple_qa

simplify-artifact-link

small_issue_with_error_box

small_modifs_to_profiler

small_typos_in_loaders

small_typos_to_profiler

social_iqa_new

space-id-only

speed-up-prep-tests

sqllite3-error

summaries-pos-bias

support-max-per-split-in-benchmarks

system-leakage

table_as_image

tables_bench

task-types

test_faithfulness_with_external_client

text2sql-execution-accuracy-metric-fix

text2sql-metric-fixes

text2sql-metrics-cache

text2sql-metrics-fixes

text2sql-metrics-update

tool-calling-3

tool-calling-correctness

tool-calling-multi-turn

tool-calling-support

tool-calling-wx_ai

torr

torr_documentation

tot

touch_the_loaded_dataset

try_lmarena-ai_arena_hard_auto

typed_recipe_artifact_saving

typo_in_intersect_corr_fields

unitxt-assistant

up-readme

upd-readme

update-ag-news

update-cov

update-datasets-descriptions

update-metrics-docs

update-sacrebleu

update-to-tool-calling-metric

update-vis-bench

update_ibm_wml_engine_#1775

update_rag_metrics

update_rag_metrics_leftover

updates-7

use-repr-for-cache

users/ofir/add_qa_template_exact_output

users/ofir/hf_inference_debug

users/ofir/template_for_bbq

users/ofir/update_Wml_llmajj

vision_bench

vision_bench_update

vision_templates

whitesource/configure

wml_comp

wxai-async-chat

wxai-chat-features

xstest

yifanmai/cross-provider-vertex-ai

yifanmai/fix-indexed-row-major-none

yifanmai/wikitq-1-shot

Committed 27 May 2026 12:07PM UTC coverage: 80.853% (-0.001%) from 80.854%

Build # 26510105925

Build Type

github

Committed by

Commit Message

Merge 2aeab368b into 398979869

Pull Request Pull Request #1960: Fix bug in parsing LLM as Judges results

Coverage Stats

1608 of 2009 branches covered (80.04%)

Branch coverage included in aggregate %.

10963 of 13539 relevant lines covered (80.97%)

0.81 hits per line

Relevant lines Covered

13539 RELEVANT LINES 10963 COVERED LINES

0.81 HITS PER LINE

Source Files on main

Recent builds

Builds	Branch	Commit	Type	Ran	Committer	Via	Coverage
26510105925	fix-mt_bench-style-llm-as-judge-post-processor	Merge 2aeab368b into 398979869	Pull #1960	27 May 2026 12:12PM UTC	web-flow	github	80.85
26506100559	main	Update version to 1.26.10 (#1970) Signed-off-by: Yoav Katz <katz@il.ibm.com>	push	27 May 2026 10:42AM UTC	web-flow	github	80.85
26499755280	1.26.10	Merge d7904982d into f2424be15	Pull #1970	27 May 2026 08:32AM UTC	web-flow	github	80.84
26499736672	main	fix: Compatibility with huggingface_hub 1.16, numpy 2.0, and pandas 3.0 (#1971) * fix: Use namespaced HF dataset paths for huggingface_hub >= 1.16 compatibility huggingface_hub 1.16+ enforces that dataset repository IDs must use the 'namespace/n...	push	27 May 2026 08:27AM UTC	web-flow	github	80.85
26498246619	fix/hf-namespaced-dataset-paths	Merge e162e8411 into 449712793	Pull #1971	27 May 2026 07:55AM UTC	web-flow	github	80.87
26496877660	fix/hf-namespaced-dataset-paths	Merge 5a60c02bf into 449712793	Pull #1971	27 May 2026 07:24AM UTC	web-flow	github	80.87
26496041440	fix/hf-namespaced-dataset-paths	Merge 28695e1fa into 449712793	Pull #1971	27 May 2026 07:04AM UTC	web-flow	github	80.87
26453488907	fix/hf-namespaced-dataset-paths	Merge b847b8872 into 449712793	Pull #1971	26 May 2026 03:10PM UTC	web-flow	github	80.86
26218221043	security/fix-cwe95-eval-injection	Merge 955cfe51a into 876b22a5f	Pull #1964	21 May 2026 09:52AM UTC	web-flow	github	80.86
26218201046	main	fix: Update inference tests for WatsonX model deprecations and API changes (#1969) - Replace deprecated ibm/granite-3-8b-instruct with ibm/granite-4-h-small - Fix double-encoded tool call arguments in WMLInferenceEngineChat - Remove logprobs test...	push	21 May 2026 09:49AM UTC	web-flow	github	80.86

See All Builds (1895)