AI-enabled Multi-task Vision–Language Modeling for Pathology from Large-Scale Public Social Network Knowledge

The incomplete understanding of heterogeneous pathology images is limited by the inadequate amount of well-annotated publicly available image–text datasets. In this study, we collected 208,414 well-annotated pathology data. Each has a paired image and text description and this collection is so far the largest public dataset for pathology images. By jointly learning the visual and linguistic representations of the data, we proposed a multi-task AI for pathology, which achieves superior performances across multiple benchmarks and can predict previously unseen data. In addition, this framework allows image retrieval by text inputs. Serving as an image search engine, the ability to retrieve relevant images can be a powerful educational tool. In summary, this large-scale, crowdsourcing, spontaneous, and interactive public social network knowledge enabled us to establish a generic AI for pathology that is capable of handling multiple tasks. This approach has greatly enhanced our understanding and interaction with the enormous amount of pathology data available.