ISSN : 1013-0799
This study is an exploratory research that applies generative AI-based automated assessment to public library performance evaluation and examines its feasibility for adoption. To this end, we compared the evaluation results produced by a human expert in library and information science and by a generative AI system. The comparison focused on four domains of the current evaluation indicators that are scored by humans on the basis of submitted documents: space, collaboration, management planning, and best practices, and examined changes in reliability according to different prompt-engineering techniques. Using ChatGPT 5.1, we conducted automated evaluations on the documents submitted by 164 public libraries in Seoul for the 2024 public library performance evaluation. The results indicated that for domains with relatively simple content and clearly defined rating scales—space, collaboration, and management planning—the agreement between expert and AI scores was high. In contrast, in the best practices domain, which requires qualitative judgment, the discrepancy between expert and AI evaluation results was substantial. Furthermore, the highest level of reliability between expert and AI scores was observed under the condition that combined Task Information (TI) prompts, which provide structured input of the information required for evaluation, with Demonstration Information (DI) prompts, which offer illustrative examples. In particular, in the qualitative assessment domain, reliability improved significantly when DI prompts were added.
