Spaces:
Sleeping
Sleeping
***Project Information*** | |
* The project name is LibRAG (Retrieval Augmented Generation) | |
* https://github.com/BU-Spark/ml-bpl-rag/tree/main | |
* [Google Drive](https://drive.google.com/drive/folders/12_tsVcUgwdfUdXalD67NOgUL3tGeI6ss?usp=sharing) | |
* This project involved implementing natural language querying into the Digial Commonwealth project. | |
* Client: Boston Public Library | |
* Contact: Eben English | |
* Class: DS549 | |
***Dataset Information*** | |
* Our data is contained on the SCC at /projectnb/sparkgrp/ml-bpl-rag-data | |
* /vectorstore/final_embeddings/metadata_index - faiss index for the metadata | |
* /vectorstore/final_embeddings/fulltext_index - faiss index for the OCR text | |
* /full_data/bpl_data.json - metadata | |
* /full_data/clean_ft.json - fulltext | |
* We did not have formal datasets, instead we used the Digital Commonwealth API and created embeddings from it. There is no need for a data dictionary outside of [Digital Commonwealth API](https://github.com/boston-library/solr-core-conf/wiki/SolrDocument-field-reference:-public-API). | |
* What keywords or tags would you attach to the data set? | |
* Domain(s) of Application: Natural Language Processing, Library Science | |
* Civic tech | |
*The following questions pertain to the datasets you used in your project.* | |
*Motivation* | |
* We needed to create embeddings of the Digital Commonwealth's data in order to perform retrieval | |
*Composition* | |
* Each entry in the Digital Commonwealth API represents an object in their repo of varying format | |
* There were ~1.3 million total objects last we checked, about 147,000 of which containing full-text from OCR'd documents. | |
* Our data was a comprehensive snapshot, the API is being updated. | |
* Each field from the API represented metadata classifications | |
* Data is publicly accessible and non-confidential | |
*Collection Process* | |
* We collected data from an API endpoint. | |
* No sampling was performed | |
* This data was collected in October 2024 | |
*Preprocessing/cleaning/labeling* | |
* Very limited character correction was performed on the fulltext data. | |
* No transformations were applied outside of embedding. | |
* The raw data is saved in ml-bpl-rag-data/full_data/bpl_data.json (metadata) clean_ft.json (fulltext) | |
*Uses* | |
* Embedding for retrieval | |
*Distribution* | |
* This data is free to use and access by subsequent students of our project. | |
*Maintenance* | |
There is currently no system in place for cleanly updating the data, though in our instructions within WRITEUP.md we include a way to ingest your own data from the API and embed it. | |