Long Document Transformers for Pathology Report Classification
DescriptionIn recent years, deep learning-based models for electronic health records have shown impressive results in many clinical tasks. Deep learning classification models typically require large labeled training datasets and are designed to address specific clinical tasks. Transformers are powerful state-of-art language models designed to learn inherent patterns in unstructured text data in an unsupervised manner. The transformer model’s unsupervised training enables generalizability and reusability of the model to various clinical tasks, negating the need for labeled data in the training phase. The trained transformer can then be fine-tuned towards a specific clinical task using a small but task-curated training dataset. In the current work, we build a transformer model that can effectively accommodate the length of typical cancer pathology reports. We use 5.7 million pathology reports from six Surveillance, Epidemiology, and End Results Program’s (SEER) cancer registries to train “from scratch” the Big-Bird model. Big-Bird model is a transformer model built for long documents (up to 4096 tokens) compared to popular models such as BERT (up to 512 tokens). As the memory requirement of a transformer model scales quadratically with the sequence length of input text, Big-Bird utilizes sparse attention. In phase one, Big-Bird is built in an unsupervised manner using the pre-training task called masked language prediction. This phase requires the largest amount of computation, and it leverages the secure CITADEL capability for working with protected health information (PHI) data on the Summit supercomputer at the Oak Ridge Leadership Computing Facility. In phase two, we fine-tune the pre-trained Big-Bird model to handle five information extraction tasks: site, sub-site, histology, laterality, and behavior. For fine-tuning, we use data from six SEER registry data with the 10-day window constraint before and after the date of cancer diagnosis, and the ground truth for five tasks is from the manually coded CTC (Cancer/Tumor/Case) report. One advantage of this two-phase approach is the re-usability of the phase one model for any pathology-relevant clinical task in phase two. Our results show that the proposed Big-Bird model fine-tuned with SEER data on five information tasks outperforms the current state-of-the-art deep learning classification model by an average of 2% microF1 score on all tasks and an average 8% macro F1 score on all tasks. In most challenging tasks, subsite has a 4% increase in micro F1 score and histology has a 25% increase in macro F1 score. The results demonstrate the promise of using a single pretrained model on five related clinical tasks. We plan to further test the generalizability and reusability of the model by extending the tasks to other clinically useful tasks such as bio-marker extraction and identification of malignant and metastatic disease.
TimeSunday, 13 November 20229:30am - 9:45am CST