• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

IBM / unitxt
81%

Build:
DEFAULT BRANCH: main
Repo Added 24 Dec 2024 03:17PM UTC
Files 64
Badge
Embed ▾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

LAST BUILD ON BRANCH main
branch: main
CHANGE BRANCH
x
Reset
  • main
  • 1.16.1
  • 1.16.2
  • 1.16.3
  • 1.16.4
  • 1.17.0
  • 1.17.1
  • 1.17.2
  • 1.18.0
  • 1.19.0
  • 1.20.0
  • 1.21.0
  • 1.22.1
  • 1.22.3
  • 1.23.0
  • 1.23.1
  • 1.24.0
  • 1.25.0
  • 1.26.0
  • 1.26.1
  • 1.26.2
  • 1.26.3
  • 1.26.4
  • 1.26.5
  • 1.26.6
  • 2024-blog
  • Add-multiple-choice-example
  • Added-example-for-standalone-metric-evaluation
  • Added-param-to-control-of-confidence-interval-calculation-in-evaluate-api
  • Documenation-updates
  • Example-of-creating-yaml-representation-of-card
  • Fix-LoadJsonFile
  • LoadFromAPI-optional-apikey
  • accelerate-rag-metrics
  • add-audio-support
  • add-balance-operator
  • add-cache-gitignore
  • add-cross-inference-models
  • add-docstring-llm-judge
  • add-engine-id-method
  • add-format-and-system-prompt-to-meta-data
  • add-global-mmlu-lite-sensitivity-cards
  • add-granite-docs-format
  • add-hf-to-cross-provider-inference-engine
  • add-inline-template-support
  • add-metric-example
  • add-more-judges
  • add-more-llmjudge-benchmarks
  • add-more-metrics-for-schema-linking
  • add-non-verify-option-to-api-loader
  • add-quality-dataset
  • add-replicate
  • add-schema-linking
  • add-spacy-req-to-examples-tests
  • add-text2sql
  • add-text2sql-blog-post
  • add-to_markdown-to-instance-score
  • add-to_yaml-for-artifiact
  • add-tokenizer-name
  • add-vision-benchmark-example
  • add-vllm-to-cross
  • add_completeness_judge
  • add_entity_type_filter_to_operators
  • add_generation_text_to_meta_data
  • add_judges
  • add_metadata
  • added_social_iqa_card
  • airbench
  • allow-read-timeout
  • an_issue_with_loader_cache
  • api_call_evaluation
  • arc-indic-rudra
  • arena-hard-fix
  • assistant-improve-links
  • assistant_assessment
  • assitant-with-search
  • atta_q_safety
  • azure
  • banner-top-website
  • base-dep
  • batch-size-inference
  • bench-and-models
  • bench-recipe-in-cli
  • benjams/add_bioasq_miniwiki_datasets
  • benjams/add_hotpotqa
  • benjams/add_watson_x
  • benjams/enrich_tags
  • benjams/fix_bioasq_card
  • benjams/fix_clap_nq_benchmark
  • benjams/fix_clapnq
  • benjams/fix_watsonx_qa_dataset
  • biggen-bench
  • biggen-multilingual
  • biggen-revert
  • blog-update
  • cache-key-and-lock
  • ccc_inference
  • changes
  • chat_api_format
  • cli-benchmark-fix
  • cli-enhancements
  • cli-imports
  • cli-util
  • clinc-faster
  • codecov
  • comment-out-sql
  • convert-inline-templates
  • correct_tool_calling
  • correcteness-criteria
  • criteria-typo
  • criterias
  • cross-inference-add-model
  • cross-inference-custom-model
  • csv-loader
  • data-classification-cross-provider-engine
  • datasets351
  • dedup-operator
  • default-template-policy
  • demos-sampling-seed
  • demos_experimental
  • disable-litellm-cache
  • down-dount
  • ds-4-req
  • empty_yaml_strings
  • entity_squad_metric
  • eval_assist_documentation
  • evalassist-judges
  • evaluate_different_formats
  • extend-choices-order
  • extend_coverage_some
  • external_client_for_wml_infer_engine
  • f1-docs
  • feature/add-global-mmlu-cards
  • filter_if_missing_field
  • filter_wikitq
  • finqa-hash-to-top
  • fix-DiverseLabelSampler
  • fix-artifact-saving
  • fix-aus-legal-qa
  • fix-azure-llmjudge
  • fix-azure-openai
  • fix-batching
  • fix-bench-docs
  • fix-bird-task
  • fix-bootstrap-empty
  • fix-bug-when-WML-does-not-return-any-content-or-tool-call
  • fix-cache-dir
  • fix-catalog
  • fix-criteria-json
  • fix-datasets-4
  • fix-dependencies-installation
  • fix-disable-mem-caching
  • fix-examples
  • fix-fusion
  • fix-images-demos-pool
  • fix-inference
  • fix-inference-tests
  • fix-issue-in-token-decosing
  • fix-litellm-without-task-data
  • fix-load-csv
  • fix-loaders-trust
  • fix-loading2
  • fix-metrics-docs
  • fix-missing-dataset
  • fix-model-name
  • fix-multiple-source-loader
  • fix-nan-ci
  • fix-number-of-batchs
  • fix-pearsonr-tests
  • fix-rag-metrics
  • fix-rits-model-names
  • fix-scout-name
  • fix-some-tests
  • fix-tablebench-dp-split
  • fix-task-metrics
  • fix-tests
  • fix-tests-sacrebleu-ja
  • fix-tools-nested-params
  • fix-typo-in-azure-openai-variable-name-and-dictionary-key
  • fix-vision
  • fix-zero-division-in-compare-performances
  • fix/correct-choice-position-handling
  • fix/negative-index-support
  • fix_assistance_token_error
  • fix_bfcl
  • fix_global_mmlu
  • fix_llmjudge
  • fix_mmmu
  • fix_mtrag
  • fix_performance_test
  • fix_prompts_table_benchmark
  • fix_summarize_from_human_feedback
  • fix_xlam_function_calling
  • fixed-bug-in-tool-inference
  • fixed_wiki_bio
  • fixing_criterias_in_catalog
  • frames
  • from_api_import
  • function-operators
  • gg-add-prompt-to-result
  • gg-fc-fix
  • gg-hf
  • gg-prediction-field
  • global-mmlu-improvment
  • gpqa
  • granite-guardian-minor-changes
  • granite-guardian-result-type
  • granite-guardian-support
  • groupby_processor
  • handle_empty_tool_call_list
  • head-qa-updates
  • helm-test-fix
  • hf-cache
  • hf-files
  • hf-retry
  • hf-timeout
  • hf-tool-calling
  • hf_pipeline_peft
  • homepage
  • hub-rust
  • image_key_value_extrqaction
  • imports_html_button
  • improve-assistant
  • improve-context-parsing
  • improve-score-option-selection
  • improve-tc-example
  • improve__instance_scores_summary
  • improve_inference_log
  • improve_merge_error_message
  • improved-error-messages
  • improved-parsing-of-MT-bench-style-rating-parsing
  • improved_multi_turn_example
  • indic_milu
  • inference_engine_cache
  • issue-1881
  • issues-stale
  • jb/fix-arena-hard-template
  • jb/fix-cli
  • jb/gg-hack
  • jb/provoq-updates
  • jb/replicate-models
  • jb/safety-updates
  • json
  • json-loader
  • jsonschema
  • just_lazy_loader
  • just_to_run_examples
  • key_value_extraction_improvements
  • know_your_splits
  • last_line_processor
  • lazy-return-multi-stream
  • lazy_loadHF
  • llm-as-judge-metric-update-again
  • llm-judge-cot
  • llm-judge-granite-evals
  • llm-judge-judgebench
  • llm-judge-prepare
  • llm-judge-response-name
  • llm-judge-str-evaluator-name
  • llm-judge-summaries
  • llm-judge-use-cross-provider
  • llmjudge-add-prompts-by-default
  • llmjudge-changes
  • load_dataset_use_cache_default
  • local-cache
  • log-probs-hf-fix
  • long-bench
  • meteor_n_resample
  • metric_based_ner
  • metrics-formatting
  • metrics_fix
  • mistral_small_watsonx_support
  • mixed_args_support
  • mlcommons-ailuminate
  • mm_updates
  • mock-performence
  • module_name_same_catalog
  • more-bluebench-fixes
  • mtrag
  • mtrag_corpora
  • multi-turn-metrics
  • multiple-choice-improved
  • multithreading-support
  • nave_tool_calling
  • ner_example
  • networkx
  • new-base-metric
  • new-text2sql-metrics-scores
  • no_iterable_datasets
  • no_loader_cache
  • normalize-bench-target
  • nve_tool_calling
  • ollama
  • ollama-host
  • override_ci_method_globalmetric
  • pandas-403
  • patch-1
  • peft
  • performance_blue_benchmark
  • performance_no_cProfile
  • performance_no_cProfile_existing_loaders
  • pipeline_tokenizer
  • place-correct-choice-position
  • polish_performance
  • prediction-type-without-load
  • prep-tests
  • preparation3
  • prevent-ds-4
  • protobuf
  • provider-specific-args-and-allow-unroecognized-model-name
  • pythonize_the_yaml
  • rag-bench
  • rag-metric-update-again
  • ragbench
  • readme-update
  • real_mm_rag
  • refactor-inference
  • refactor-llm-ad-judge-to-map-reduce
  • reflector-integration
  • reflector-semantic-integration
  • remote_catalog
  • remove-balance-new
  • remove-ibm-branding-from-doc
  • remove-src-lock
  • remove_bam_llm_as_judges
  • remove_break_point
  • remove_ds351_installation
  • remove_genai_support
  • remove_gpqa_experts
  • remove_redundant_from_performance_yml
  • renovate/configure
  • return_source_to_recipe_to_performance
  • reuse-hf-cache-for-actions
  • reuters-improvments
  • rits_infer
  • safety-benchmark
  • safety_airbench2024
  • settings-docs
  • simple_qa
  • simplify-artifact-link
  • small_issue_with_error_box
  • small_modifs_to_profiler
  • small_typos_in_loaders
  • small_typos_to_profiler
  • social_iqa_new
  • space-id-only
  • speed-up-prep-tests
  • sqllite3-error
  • summaries-pos-bias
  • support-max-per-split-in-benchmarks
  • system-leakage
  • table_as_image
  • tables_bench
  • task-types
  • test_faithfulness_with_external_client
  • text2sql-execution-accuracy-metric-fix
  • text2sql-metric-fixes
  • text2sql-metrics-cache
  • text2sql-metrics-fixes
  • text2sql-metrics-update
  • tool-calling-3
  • tool-calling-correctness
  • tool-calling-multi-turn
  • tool-calling-support
  • tool-calling-wx_ai
  • torr
  • torr_documentation
  • tot
  • touch_the_loaded_dataset
  • try_lmarena-ai_arena_hard_auto
  • typed_recipe_artifact_saving
  • typo_in_intersect_corr_fields
  • unitxt-assistant
  • up-readme
  • upd-readme
  • update-ag-news
  • update-cov
  • update-datasets-descriptions
  • update-metrics-docs
  • update-sacrebleu
  • update-to-tool-calling-metric
  • update-vis-bench
  • update_ibm_wml_engine_#1775
  • update_rag_metrics
  • update_rag_metrics_leftover
  • updates-7
  • use-repr-for-cache
  • users/ofir/add_qa_template_exact_output
  • users/ofir/hf_inference_debug
  • users/ofir/template_for_bbq
  • users/ofir/update_Wml_llmajj
  • vision_bench
  • vision_bench_update
  • vision_templates
  • whitesource/configure
  • wml_comp
  • wxai-async-chat
  • wxai-chat-features
  • xstest
  • yifanmai/cross-provider-vertex-ai
  • yifanmai/fix-indexed-row-major-none
  • yifanmai/wikitq-1-shot

03 Nov 2025 01:37PM UTC coverage: 80.889% (-0.004%) from 80.893%
19036532707

push

github

web-flow
Correct reflection based tool calling metrics so valid results will be 1. (#1940)

* Correct reflection based tool calling metrics so valid results will be 1.

Also added example of using reflection based tool calling to correct tool calls and reevaluate.

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Fixed the test of the tool calling evaluation metrics accordingly.

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Co-authored-by: Koren Lazar <koren.lazar@ibm.com>

1607 of 2006 branches covered (80.11%)

Branch coverage included in aggregate %.

10947 of 13514 relevant lines covered (81.0%)

0.81 hits per line

Relevant lines Covered
Build:
Build:
13514 RELEVANT LINES 10947 COVERED LINES
0.81 HITS PER LINE
Source Files on main
  • Tree
  • List 64
  • Changed 2
  • Source Changed 0
  • Coverage Changed 2
Coverage ∆ File Lines Relevant Covered Missed Hits/Line Branch Hits Branch Misses

Recent builds

Builds Branch Commit Type Ran Committer Via Coverage
19036532707 main Correct reflection based tool calling metrics so valid results will be 1. (#1940) * Correct reflection based tool calling metrics so valid results will be 1. Also added example of using reflection based tool calling to correct tool calls and ree... push 03 Nov 2025 01:44PM UTC web-flow github
80.89
18444811472 main Fixed missing sampling_seed in DiverseLabelsSampler (#1941) * Fixed missing sampling_seed in DiverseLabelsSampler * Downgrade dyndamic in git action to run LLM eval Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz... push 12 Oct 2025 01:52PM UTC web-flow github
80.89
17891966098 main potential fix for preparation file: prepare/cards/mtrag.py (#1938) Signed-off-by: dafnapension <dafnashein@yahoo.com> push 21 Sep 2025 09:44AM UTC web-flow github
80.91
17639482433 main Add ReflectionToolCallingMetric and update related metrics (#1931) * Add ReflectionToolCallingMetric and update related metrics - Introduced ReflectionToolCallingMetric for assessing syntactic and semantic validity of tool calls. - Updated Multi... push 11 Sep 2025 09:05AM UTC web-flow github
80.91
17464258472 main Add more RAG judges (#1934) Signed-off-by: Ariel Gera <ariel.gera1@ibm.com> push 04 Sep 2025 12:50PM UTC web-flow github
80.81
17430157886 main Improved multi turn evaluation to be self contained and use LLM as judge (#1929) * Improved multi turn evaluation to be self contained and use LLM as judge Also improved printout summary of instance score * Moved to normalized sacrebleu * Chan... push 03 Sep 2025 10:14AM UTC web-flow github
80.8
17429961783 main Normalize llm judge bench target variable (#1933) Signed-off-by: Martín Santillán Cooper <marsancoo@gmail.com> push 03 Sep 2025 10:07AM UTC web-flow github
80.79
17267490019 main fix the only 4 erroneous global_mmlu cards that do not pass _source_to_dataset (#1916) * fix the 4 cards Signed-off-by: dafnapension <dafnashein@yahoo.com> * cast rather than filter. passes metrics Signed-off-by: dafnapension <dafnashein@yahoo... push 27 Aug 2025 01:08PM UTC web-flow github
80.8
17192082979 main Fix erroneous prompts in evaluation tasks (and clean some json-schema-wise) (#1920) * cast None to str, to comply with json schema Signed-off-by: dafnapension <dafnashein@yahoo.com> * fix template and tasks of evaluation Signed-off-by: dafnape... push 24 Aug 2025 06:14PM UTC web-flow github
80.81
17191792688 main fixed spit names in wiki_bio (#1925) Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> push 24 Aug 2025 05:46PM UTC web-flow github
80.8
See All Builds (1830)
  • Repo on GitHub
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2025 Coveralls, Inc