AnIma / Ocelot f_1

Update @ 2025.08.04: First release of malpyung_korean_language_rag_sota

This model card corresponds to the 10.8B Instruct version of the Yanolja EEVE model.

Resources and Technical Documentation:

Citation

@misc {ai-AnIma/malpyung_korean_language_rag_sota,
    author       = { {frcp, nebchi, DaKu00, philosokey-M} },
    title        = { malpyung_rag },
    year         = 2025,
    url          = { https://huggingface.co/ai-AnIma/malpyung_korean_language_rag_sota },
    publisher    = { Hugging Face }
}

Model Developers: frcp, nebchi, DaKu00, philosokey-M

과제 κ°œμš”: ꡭ립ꡭ어원 ν•œκ΅­μ–΄ μ–΄λ¬Έ κ·œλ²” 기반 생성

  • λ³Έ κ³Όμ œλŠ” ν•œκ΅­μ–΄ μ–΄λ¬Έ κ·œλ²” κ΄€λ ¨ μ§ˆλ¬Έμ— λŒ€ν•΄, κ΅­μ–΄ 지식을 μ°Έμ‘°ν•˜μ—¬ μ •λ‹΅κ³Ό κ·Έ 이유λ₯Ό μƒμ„±ν•˜λŠ” 것을 λͺ©ν‘œλ‘œ ν•©λ‹ˆλ‹€. 이 κ³Όμ œλŠ” κ΅­λ¦½κ΅­μ–΄μ›μ˜ γ€Œ2024λ…„ κΈ€μ“°κΈ° 첨삭 지원을 μœ„ν•œ μ§€μ‹œλ¬Έ 기반 생성 λ§λ­‰μΉ˜ ꡬ좕 연ꡬ」 μ‚¬μ—…μ˜ 결과물인 γ€ŒκΈ€μ“°κΈ° 첨삭 지원을 μœ„ν•œ 기초 μžλ£Œγ€ λ₯Ό 기반으둜 μ„€κ³„λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

🧠 Model Details

검색 ꡬ쑰: Hybrid Search + Reranker

  • λ³Έ μ‹œμŠ€ν…œμ€Hybrid Search에 Cross-Encoder 기반 Rerankerλ₯Ό κ²°ν•©ν•œ Advanced RAG ꡬ쑰λ₯Ό μ±„νƒν•˜μ˜€μŠ΅λ‹ˆλ‹€.
  • Hybrid SearchλŠ” Reciprocal Rank Fusion (RRF) μ•Œκ³ λ¦¬μ¦˜μ„ 톡해 λ‹€μ–‘ν•œ 검색 결과의 μˆœμœ„λ₯Ό ν†΅ν•©ν•˜μ—¬, 검색 정밀도와 닀양성을 λ™μ‹œμ— ν™•λ³΄ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
  • 이후 Cross-Encoder Rerankerκ°€ μ§ˆμ˜μ™€ 후보 λ¬Έμ„œ κ°„ 의미 정합성을 ν‰κ°€ν•˜μ—¬ μ΅œμ’… 응닡 ν’ˆμ§ˆμ„ κ·ΉλŒ€ν™”ν•©λ‹ˆλ‹€.

μž„λ² λ”© 및 Reranker λͺ¨λΈ: Qwen3 μ•„ν‚€ν…μ²˜ 기반

  • Qwen3 기반 Embedding λͺ¨λΈκ³Ό Cross-Encoder RerankerλŠ” λͺ¨λ‘ Open Model둜, MTEB벀치마크의 Retrivalκ³Ό STSμ—μ„œ SOTAλ₯Ό κΈ°λ‘ν•œ μ•„ν‚€ν…μ²˜μž…λ‹ˆλ‹€.
  • ν•œκ΅­μ–΄λ₯Ό ν¬ν•¨ν•œ λ‹€μ–‘ν•œ μ–Έμ–΄μ—μ„œ λ¬Έλ§₯ νŒŒμ•…, 의미 ν‘œν˜„, μ •ν™•ν•œ λ¬Έμ„œ μž¬μ •λ ¬μ— 강점을 μ§€λ‹ˆλ©°, μ‹€μ œ 질의 응닡 νλ¦„μ—μ„œ κ³ μ •λ°€ 검색 μ„±λŠ₯을 λ°œνœ˜ν•©λ‹ˆλ‹€.

PDF λ¬Έμ„œ 처리:

  • PDF 기반 λ¬Έμ„œμ˜ μ •μ œλœ 정보 μΆ”μΆœμ„ μœ„ν•΄ Layout 뢄석(layout analyzing) 기법과 Semantic Chunking을 ν™œμš©ν•˜μ—¬ 문단 ꡬ쑰λ₯Ό μž„λ² λ”© λͺ¨λΈμ„ ν™œμš©ν•˜μ—¬ μ²˜λ¦¬ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
  • OCR을 병행 μ μš©ν•˜μ—¬, λ¬Έμ„œ λ‚΄ λͺ¨λ“  정보가 검색 및 RAG에 ν™œμš© κ°€λŠ₯ν•˜λ„λ‘ μ „μ²˜λ¦¬ νŒŒμ΄ν”„λΌμΈμ„ κ΅¬μΆ•ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

ν•΄λ‹Ή νŒŒμ΄ν”„λΌμΈμ€ 의미 λ‹¨μœ„ 청크 ꡬ성 β†’ μž„λ² λ”© β†’ Hybrid κ²€μƒ‰μ˜ 흐름을 기반으둜 정확도 높은 검색 및 응닡 생성을 μ‹€ν˜„ν•©λ‹ˆλ‹€.

πŸ› οΈ λͺ¨λΈ μ‚¬μš© μ˜ˆμ‹œ

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain.vectorstores import Qdrant
from langchain.vectorstores.qdrant import QdrantVectorStore, RetrievalMode

# λͺ¨λΈ 및 ν† ν¬λ‚˜μ΄μ € λ‘œλ“œ
tokenizer = AutoTokenizer.from_pretrained("ai-AnIma/malpyung_korean_language_rag_sota")
model = AutoModelForCausalLM.from_pretrained(
    "ai-AnIma/malpyung_korean_language_rag_sota", device_map="auto"
)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=4096)

# Qdrant 벑터 DB λ‘œλ“œ
qdrant = QdrantVectorStore(
    client=client,
    collection_name="my_documents",
    embedding=embeddings,
    sparse_embedding=sparse_embeddings,
    retrieval_mode=RetrievalMode.HYBRID,
    vector_name="dense",
    sparse_vector_name="sparse",
)

# 질의
query = "'가좕을 κΈ°λ₯Ό λ•Œμ—λŠ” {λ¨Ήμ΄λŸ‰/먹이양}을 μ‘°μ ˆν•΄ μ£Όμ–΄μ•Ό ν•œλ‹€.' κ°€μš΄λ° μ˜¬λ°”λ₯Έ 것을 μ„ νƒν•˜κ³ , κ·Έ 이유λ₯Ό μ„€λͺ…ν•˜μ„Έμš”."

# λ¬Έμ„œ 검색
found_docs = qdrant.similarity_search(query, k=5)
found_texts = "\n".join([doc.page_content for doc in found_docs])

# ν”„λ‘¬ν”„νŠΈ ν…œν”Œλ¦Ώ μ±„μš°κΈ°
prompt_template = """
λ‹€μŒ 정보λ₯Ό λ°”νƒ•μœΌλ‘œ μ§ˆλ¬Έμ— λ‹΅ν•˜μ„Έμš”:
{context}

질문: {question}

μ£Όμ–΄μ§„ μ§ˆλ¬Έμ—λ§Œ λ‹΅λ³€ν•˜μ„Έμš”. λ¬Έμž₯으둜 λ‹΅λ³€ν•΄μ£Όμ„Έμš”. λ‹΅λ³€ν•  λ•Œ 질문의 μ£Όμ–΄λ₯Ό μ¨μ£Όμ„Έμš”.
λ‹΅λ³€:
"""

filled_prompt = prompt_template.format(context=found_texts, question=query)

messages = [
    {"role": "user", "content": filled_prompt}
]

# Chat template 적용
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# 생성
outputs = pipe(prompt, temperature=0.2, top_p=0.9)
print(outputs[0]["generated_text"][len(prompt):])

results

"가좕을 κΈ°λ₯Ό λ•Œμ—λŠ” 먹이양을 μ‘°μ ˆν•΄ μ£Όμ–΄μ•Ό ν•œλ‹€."κ°€ μ˜³λ‹€. ν•œ 음절의 ν•œμžμ–΄λŠ” μ•žλ§μ΄ κ³ μœ μ–΄λ‚˜ μ™Έλž˜μ–΄μΌ λ•ŒλŠ” 독립적인 ν•œ λ‹¨μ–΄λ‘œ μΈμ‹ν•˜μ—¬ λ‘μŒ 법칙을 μ μš©ν•˜κ³ , μ•žλ§μ΄ ν•œμžμ–΄μΌ λ•ŒλŠ” ν•˜λ‚˜μ˜ λ‹¨μ–΄λ‘œ μΈμ •ν•˜μ§€ μ•Šμ•„ λ‘μŒ 법칙을 μ μš©ν•˜μ§€ μ•ŠλŠ”λ‹€. λ”°λΌμ„œ ν•œμžμ–΄ '量'은 μ•žλ§μ΄ κ³ μœ μ–΄λ‚˜ μ™Έλž˜μ–΄μΌ λ•ŒλŠ” 'μ–‘'이 되고 ν•œμžμ–΄μΌ λ•ŒλŠ” 'λŸ‰'이 λœλ‹€. '먹이'λŠ” κ³ μœ μ–΄μ΄λ―€λ‘œ '먹이양'이 λ§žλŠ” 말이닀.

βœ… Evaluation Results

λ³Έ λͺ¨λΈ AnIma/f_1은 λ‚΄λΆ€ 베이슀라인 λͺ¨λΈκ³Ό λΉ„κ΅ν•΄μ„œ λͺ¨λ“  μ£Όμš” μ§€ν‘œμ—μ„œ 졜고 μ„±λŠ₯을 κΈ°λ‘ν–ˆμŠ΅λ‹ˆλ‹€.

Model ID F1 Score BLEURT BERTScore ROUGE-1 제좜 μΌμ‹œ
AnIma/f_1 (Hybrid+Reranker) 68.52 59.06 80.01 45.56 2025.07.31 21:40
try_01 (Not Hybrid) 57.41 57.22 55.17 40.22 2025.07.21 13:41
qwen-8b (λŒ€νšŒ 제좜) 42.19 34.54 53.13 70.36 2025.06.16 07:45
hyperclovax-1.5b (λŒ€νšŒ 제좜) 39.27 31.93 46.73 70.69 2025.06.16 07:44
  • κΈ°μ‘΄ 자체 μ‹€ν—˜ λͺ¨λΈ λŒ€λΉ„ BLEURT, BERTScore, ROUGE-1 λͺ¨λ‘μ—μ„œ +10~30점의 μœ μ˜λ―Έν•œ μ„±λŠ₯ ν–₯상을 κΈ°λ‘ν–ˆμŠ΅λ‹ˆλ‹€.
  • κ²½μ§„λŒ€νšŒ baseline λͺ¨λΈ λŒ€λΉ„ BLEURT κΈ°μ€€ μ΅œλŒ€ +24.5점, BERTScore κΈ°μ€€ +33점 ν–₯μƒλ˜μ—ˆμœΌλ©°, 평균 점수 κΈ°μ€€μœΌλ‘œλ„ μ•½ 20점 이상 ν–₯μƒλœ κ²°κ³Όλ₯Ό λ³΄μ˜€μŠ΅λ‹ˆλ‹€.
Downloads last month
-
Safetensors
Model size
11B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support