KIE bench 추가 (VQA 단독모델, full benchmark) #15

ts-kim · 2025-04-30T08:38:45Z

설명

1. lmms-eval 구현

lmms-eval 구현 방식
- 두가지 task로 구축:
  1. DocEV VLM 평가를 위한 full benchmark
  2. DocEV table captioning engine 평가를 위한 table benchmark
- 두가지 파이프라인으로 구축:
  1. VLM 단일 모델 IE
  2. VLM + LLM (VLM captioning or VLM DP → LLM IE)

위 내용 중 볼드처리된 부분들이 진행되었습니다.
나머지는 진행되면 새로 PR 올리겠습니다.

데이터 2개짜리 샘플 데이터를 구축해 아래 명령어로 테스트했습니다.

bash scripts/run_eval.sh --model_path /app/docfm/checkpoints/training/DocVision/SFT/solar-enkoja-10.7b-16K-1.2.1-chat.1-siglip-so400m-patch14-560-LLaVA-NeXT-LDPv2-random-res-560-single-page-SFT-multiple-page/steps_1464 --tasks KIE_bench --gpu_ids 0,3 --port 35001

2. eos token 자동 적용 구현 (Issue #10)

위 모델 테스트 중 에러가 나서 tokenizer에서 가져오도록 수정하였습니다.

Issues

구현 시 발생했던 이슈들을 노션 페이지에 정리해두었으니 참고바랍니다.

Issue 1

현재 prediction이 schema 형태 그대로 추출되지 않는 문제가 있습니다.

Issue 2

프롬프트 고도화가 필요할 수 있습니다.

… as HuggingFace dataset

hancheolcho

고생 많으셨습니다~
궁금한 부분들에 코멘트 추가했습니다.

hancheolcho · 2025-04-30T09:32:56Z

lmms_eval/api/task.py

            # using local task in offline environment, need to process the online dataset into local format via
            # `ds = load_datasets("lmms-lab/MMMU")`
-            self.dataset = datasets.load_from_disk(path=self.DATASET_PATH, name=self.DATASET_NAME)
+            self.dataset = datasets.load_from_disk(dataset_path=self.DATASET_PATH)


이 부분은 lmms-eval 최신 코드를 적용하신 것 같은데 맞을까요?

https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/93e1ac9c3232812dde5c2e55a786fde6fcb7d6d8/lmms_eval/api/task.py#L1042

나중에 최신 main 으로 rebase 한번 해야겠군요.

넵 맞습니다.
말씀하신대로 똑같은 fix가 적용되어 있던데, main rebase작업 자체도 큰 일 같아 일단 해당 부분만 처리해두었습니다

hancheolcho · 2025-04-30T09:41:41Z

lmms_eval/models/docvision.py

@@ -74,6 +72,15 @@ def get_config(pretrained):
            test=True,
        )
        self._model.eval()
+
+        try:
+            self._eos_token = self._tokenizer.eos_token


lmms-eval 에서 생성 종료 토큰을 지정하는 변수로 _eos_token 이라는 변수명으로 부르고 있긴 한데, 정확히는 End-Of-Sequence (eos) 토큰이 아니라 End-Of-Turn (eot, instruction의 종료) 토큰을 지정해야 합니다.

예를 들어, Phi 모델의 경우에 토크나이저의 eos_token은 <|endoftext|> 로 지정되어 있는데요 (첨부이미지1).
이걸 사용하면 assistant 출력 후에도 user-assistant-user-... 의 출력을 생성할 수 있습니다.
그래서 <|end|> 토큰 (Phi 모델들의 instruction turn end token)을 지정해줘야 합니다.

이 부분은 모델마다 달라서 쉽게 자동화하기가 어렵더라구요.
기존과 같이 config에서 지정하도록 하고, 좋은 해결책이 나타나면 수정하면 좋을 듯 합니다.

오 그렇군요. 해당 부분은 원복하도록 하겠습니다

hancheolcho · 2025-04-30T09:44:26Z

lmms_eval/tasks/KIE_bench/UpScore/__init__.py

UpScore 디렉토리 밑의 파일들은 기존 코드를 이식했다고 보면 될까요?
혹시 수정한 부분이 있는지 궁금합니다.

수정 없이 그대로 이식하였습니다.

hancheolcho · 2025-04-30T09:47:54Z

lmms_eval/tasks/KIE_bench/KIE_bench_test.yaml

+# Model-specific prompt configurations
+lmms_eval_specific_kwargs:
+  default:
+    pre_prompt: "Extract information from the given image based on this schema: "


마지막에 \n이 필요할까요?

밑에 utils.py 를 보면 pre_prompt, question, post_prompt를 구분자 없이 붙이고 있는데, post_prompt는 맨 앞에 \n이 있더라구요.
question은 아마 strip된 텍스트일테니, pre_prompt 끝에 \n을 붙여줘야하나?라는 궁금증이 들었습니다.

\n이 들어간 이유

기존의 프롬프트는 info-extractor-engine 의 main branch에서 가져온 것인데요.

EXTRACT_USER_PROMPT = """Content to analyze: {content} 1. If you cannot find the information or the value is not mentioned, return nothing. 2. If you can find more than one value for a key, return all the values in an array. 3. Return the value only if the given key’s value exists in the provided content. If it does not exist, return empty string."""

위와 같이 정의되어있습니다. 여기에서 빈 줄을 따라 만들었기에 '\n'이 포함된 형태로 사용되었습니다.

content: 문서 HTML

고려해야할 점

schema 포함 여부에 따른 프롬프트 변화

기존에는 LLM의 평가를 위한 프롬프트이고 schema는 formatting을 위한 자료로서 따로 입력되는 형태였다면, lmms-eval에서는 VLM을 평가하기 위한 프롬프트이고, schema가 포함되어야 합니다.

모델 입력 별도 입력

기존 HTML, 프롬프트 schema

lmms-eval 이미지, 프롬프트, schema

이러한 문제를 해결하기 위해 lmms-eval 에서는 [이미지][pre prompt - schema - post prompt] 형태로 입력했습니다.

schema 이후 줄바꿈을 해야하는가?

콜론 뒤에 나타나는 내용과 1,2,3으로 표시되는 조건들을 구분하기 위해 있으면 좋겠다는 것이 제 의견입니다.

hancheolcho · 2025-04-30T09:55:11Z

lmms_eval/tasks/KIE_bench/utils.py

+    gold_path = item["gold_path"]
+    with open(gold_path, "r") as f:
+        gold = json.load(f)
+    return str(gold)


평가에는 KIE_bench_process_results() 함수와 KIE_bench_aggregate_results() 함수가 이용되는 것이죠?

KIE_bench_doc_to_target(item)은 token loss 계산하는데 쓰이는건가요?.?

말씀하신대로 KIE_bench_process_results() 함수와 KIE_bench_aggregate_results() 함수가 평가를 위해 사용됩니다.

doc_to_target 함수는 현재 코드에서는 사용되지 않고 있습니다.
-> 만약 다른 metric을 KIE-bench에서 추가한다면 사용될 수도 있습니다.

함수를 아예 구현하지 않으면 lmms eval api에서 에러가 나서, 우선은 string 형태의 gt를 넣어두는 것으로 해두었습니다.

생각해보니, 사용하지 않는 곳에서 데이터를 로드하는 것이 비효율적인 것 같아 empty string ""을 return하는 것으로 변경하도록 하겠습니다. UpScore가 아닌 메트릭을 쓰게 된다면 그 때 가서 다시 추가하면 좋을 것 같습니다.

그렇군요.
히스토리 (현재 사용 중이 아님)를 코멘트로 한줄 남겨주실 수 있을까요?

hancheolcho

고생 많으셨습니다.

코드에 코멘트를 추가하면 좋을 것 같은 부분이 있었습니다.
동작에 변화를 가져오는 부분이 아니라서 Approve는 먼저 하겠습니다.

hancheolcho · 2025-05-02T13:05:45Z

lmms_eval/tasks/KIE_bench/utils.py

+    gold_path 
67F4
= item["gold_path"]
+    with open(gold_path, "r") as f:
+        gold = json.load(f)
+    return str(gold)


그렇군요.
히스토리 (현재 사용 중이 아님)를 코멘트로 한줄 남겨주실 수 있을까요?

ts-kim added 7 commits April 29, 2025 16:20

Add KIE_bench_to_HF_dataset.py for aggregating KIE benchmark datasets…

5ef502e

… as HuggingFace dataset

debug: Fix load_from_disk

d3ad89e

debug: fix dataset merge algorithm & save HF dataset as DatasetDict

dea7bfb

Add KIE_bench task configuration and utility functions

0266ddf

load eos_token automatically

0461777

Implement KIE-bench task

73cc319

Change prompt

9afcc1d

ts-kim self-assigned this Apr 30, 2025

ts-kim requested review from tkdcjf159 and hancheolcho April 30, 2025 08:39

ts-kim added the enhancement New feature or request label Apr 30, 2025

8000 hancheolcho reviewed Apr 30, 2025

View reviewed changes

ts-kim added 2 commits May 2, 2025 09:04

revert: "load eos_token automatically"

0591aa2

change doc_to_target to return empty string

2057e1a

hancheolcho approved these changes May 2, 2025

View reviewed changes

docs: add descriptions for empty return of KIE_bench_doc_to_target

0ec3f07

ts-kim merged commit 30ffbff into main May 6, 2025
0 of 2 checks passed

ts-kim deleted the taesung/feat/Add-KIE-Bench branch May 27, 2025 04:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KIE bench 추가 (VQA 단독모델, full benchmark) #15

KIE bench 추가 (VQA 단독모델, full benchmark) #15

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

10000 Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

모델	입력	별도 입력
기존	HTML, 프롬프트	schema
lmms-eval	이미지, 프롬프트, schema

KIE bench 추가 (VQA 단독모델, full benchmark) #15

KIE bench 추가 (VQA 단독모델, full benchmark) #15

Uh oh!

Conversation

관련자료

설명

1. lmms-eval 구현

2. eos token 자동 적용 구현 (Issue #10)

Issues

Issue 1

Issue 2

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

10000 Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

\n이 들어간 이유

고려해야할 점

schema 포함 여부에 따른 프롬프트 변화

schema 이후 줄바꿈을 해야하는가?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!