바로가기메뉴

본문 바로가기 주메뉴 바로가기
 

Journal of Korean Library and Information Science Society

  • P-ISSN2466-2542
  • KCI

Embedding Model-Based Approach to Duplicate Verification in MARC Records

Journal of Korean Library and Information Science Society / Journal of Korean Library and Information Science Society, (P)2466-2542;
2025, v.56 no.4, pp.1-20
https://doi.org/10.16981/kliss.56.4.202512.1
Soon-Young Lee
Min-Geon Song
Soo-Sang Lee

Abstract

This study aimed to improve the performance of duplicate verification algorithms for MARC records by applying AI technology. To overcome the limitations of existing rule-based algorithms, we utilized AI embedding models based on semantic similarity of text to vectorize MARC records and verify duplicate records through similarity search and semantic similarity analysis. The specific research methodology consisted of two phases. First, we implemented a duplicate verification algorithm for MARC records based on vector similarity search using embedding models and evaluated its performance using the same dataset as the prior study. Second, reflecting on the evaluation results of the initial experiment, we implemented an algorithm that maximizes the advantages of the embedding approach—specifically, identifying duplicate records caused by variations in string notation. For this purpose, we evaluated the algorithm’s performance using newly constructed experimental data and evaluation metrics. The experimental dataset was designed to reflect notational variations that may occur in actual library settings, applying eight transformation rules. The results of the first experiment showed that the rate of correctly identifying identical groups as duplicates improved compared to the prior study. However, the embedding approach revealed limitations in areas requiring precise matching of numbers and special characters, such as incorrectly judging multi-volume materials with different volume information as similar. The results of the second experiment, designed to validate the advantages of the embedding approach, demonstrated 100% identification of both duplicate records and transformation rules across the entire experimental dataset.

keywords
AI, Embedding Models, Vector Similarity Search, MARC Records, Duplicate Verification
Received
2025-11-19
Accepted
2025-12-18
Published
2025-12-30

Journal of Korean Library and Information Science Society