Embedding Model-Based Approach to Duplicate Verification in MARC Records

Lee  Soon-Young; 이순영; Song  Min-Geon; 송민건; Lee Soo-Sang; 이수상

doi:10.16981/kliss.56.4.202512.1

Abstract

This study aimed to improve the performance of duplicate verification algorithms for MARC records by applying AI technology. To overcome the limitations of existing rule-based algorithms, we utilized AI embedding models based on semantic similarity of text to vectorize MARC records and verify duplicate records through similarity search and semantic similarity analysis. The specific research methodology consisted of two phases. First, we implemented a duplicate verification algorithm for MARC records based on vector similarity search using embedding models and evaluated its performance using the same dataset as the prior study. Second, reflecting on the evaluation results of the initial experiment, we implemented an algorithm that maximizes the advantages of the embedding approach—specifically, identifying duplicate records caused by variations in string notation. For this purpose, we evaluated the algorithm’s performance using newly constructed experimental data and evaluation metrics. The experimental dataset was designed to reflect notational variations that may occur in actual library settings, applying eight transformation rules. The results of the first experiment showed that the rate of correctly identifying identical groups as duplicates improved compared to the prior study. However, the embedding approach revealed limitations in areas requiring precise matching of numbers and special characters, such as incorrectly judging multi-volume materials with different volume information as similar. The results of the second experiment, designed to validate the advantages of the embedding approach, demonstrated 100% identification of both duplicate records and transformation rules across the entire experimental dataset.

keywords: AI, Embedding Models, Vector Similarity Search, MARC Records, Duplicate Verification

바로가기메뉴

Journal of Korean Library and Information Science Society

Article Contents

Vol.56 No.4

Embedding Model-Based Approach to Duplicate Verification in MARC Records

Abstract

Journal of Korean Library and Information Science Society