ISSN : 1738-6764
In this study, we present a Retrieval-Augmented Generation (RAG)-based pipeline designed to extract key values from scientific literature on Silicon Carbide (SiC) crystal growth using the Physical Vapor Transport (PVT) method. To improve the relevance and completeness of the retrieved context, we implemented a hybrid retrieval strategy that combines dense retrieval via FAISS with sparse retrieval using BM25. We employed two distinct prompting approaches for key value extraction. The first approach addresses interactive user queries by utilizing the retrieved context to generate informed responses. The second approach, intended for bulk extraction, follows a two-step process: a binary classification prompt first checks for the presence of relevant information related to a query. If relevant information is confirmed, a subsequent prompt extracts the value under strict constraints—requiring exact phrasing without guessing or explanation. This binary pre-check significantly enhances the identification of true negative cases, thereby reducing irrelevant or missing data. For the generative component of our pipeline, we evaluated three large language models (LLMs): Llama 8B, Gemma 7B, and Mistral 7B, all operating on a local multi-GPU environment using FP16 precision. The results reveal differences in the efficiency of these models within our customized RAG system, particularly in their performance in extracting over 156 targeted technical key-value pairs from 13 benchmark papers.
