Northeastern Society of Plastic Surgeons

NESPS Home NESPS Home Past & Future Meetings Past & Future Meetings

Back to 2025 Abstracts


PlasticsGPT: A Validation Study of Automated Systematic Review Screening with a Large Language Model (LLM) in Plastic & Reconstructive Surgery
Akash Kapoor, Nikhil A. Gangoli*, Myles N. LaValley, Jarrod T. Bogue
Plastic Surgery, Columbia University Vagelos College of Physicians & Surgeons, New York, NY

Background:
Research productivity in plastic and reconstructive surgery has grown exponentially, increasing the importance of systematic reviews to guide clinical practice. However, these reviews are time and resource-intensive. Large Language Models (LLMs) offer a promising solution for high-volume text screening, though their use in plastic surgery remains uncharted. We present an analysis of a custom LLM framework's performance in a plastic surgery systematic review.
Methods:
We designed PlasticsGPT, a multi-step literature screening framework built on GPT-4o. Performance was assessed by comparison to inclusion/exclusion decisions made by two human reviewers in a published review of plastic surgical involvement in orthopedic oncology. A post hoc analysis was conducted to investigate inappropriately excluded articles.
Results:
PlasticsGPT achieved 90.4% accuracy (95% CI: 88.3%-92.5%) in classifying 778 articles. Specificity was 91.4% (95% CI: 89.4%-93.4%) and sensitivity was 65.4% (95% CI: 42.5%-83.2%), with nine false negatives. The AUROC was 0.784 (95% CI: 0.679-0.889). Abstract screening (n = 778) took 50.3 minutes (3.9s/abstract) and full text decisions (n = 175) took 12.3 minutes (4.2s/article). Results included the model's reasoning at each stage. Upon post hoc analysis of the nine false negatives, four were unprocessable scanned documents and five were excluded at the abstract stage. When readable full texts were provided, seven of nine were correctly included. The remaining two resulted from human reviewer deviations from the model's explicit criteria. Model accuracy improved to 91.3% (95% CI: 89.3%-93.3%), sensitivity improved to 92.3% (95% CI: 74.5%-99.1%), and AUROC improved to 0.918 (95% CI: 0.845-0.991).
Conclusions:
PlasticsGPT is highly accurate and resource efficient in screening articles involving complex plastic surgery populations, drastically streamlining systematic review workflow. Full-text review took just 0.3s/article longer than abstract screening. Future assessment of PlasticsGPT will trial extraction of plastic-surgery-specific data.
Back to 2025 Abstracts