Spaces:
Sleeping
Sleeping
File size: 2,553 Bytes
a9e136f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
***Project Information***
* The project name is LibRAG (Retrieval Augmented Generation)
* https://github.com/BU-Spark/ml-bpl-rag/tree/main
* [Google Drive](https://drive.google.com/drive/folders/12_tsVcUgwdfUdXalD67NOgUL3tGeI6ss?usp=sharing)
* This project involved implementing natural language querying into the Digial Commonwealth project.
* Client: Boston Public Library
* Contact: Eben English
* Class: DS549
***Dataset Information***
* Our data is contained on the SCC at /projectnb/sparkgrp/ml-bpl-rag-data
* /vectorstore/final_embeddings/metadata_index - faiss index for the metadata
* /vectorstore/final_embeddings/fulltext_index - faiss index for the OCR text
* /full_data/bpl_data.json - metadata
* /full_data/clean_ft.json - fulltext
* We did not have formal datasets, instead we used the Digital Commonwealth API and created embeddings from it. There is no need for a data dictionary outside of [Digital Commonwealth API](https://github.com/boston-library/solr-core-conf/wiki/SolrDocument-field-reference:-public-API).
* What keywords or tags would you attach to the data set?
* Domain(s) of Application: Natural Language Processing, Library Science
* Civic tech
*The following questions pertain to the datasets you used in your project.*
*Motivation*
* We needed to create embeddings of the Digital Commonwealth's data in order to perform retrieval
*Composition*
* Each entry in the Digital Commonwealth API represents an object in their repo of varying format
* There were ~1.3 million total objects last we checked, about 147,000 of which containing full-text from OCR'd documents.
* Our data was a comprehensive snapshot, the API is being updated.
* Each field from the API represented metadata classifications
* Data is publicly accessible and non-confidential
*Collection Process*
* We collected data from an API endpoint.
* No sampling was performed
* This data was collected in October 2024
*Preprocessing/cleaning/labeling*
* Very limited character correction was performed on the fulltext data.
* No transformations were applied outside of embedding.
* The raw data is saved in ml-bpl-rag-data/full_data/bpl_data.json (metadata) clean_ft.json (fulltext)
*Uses*
* Embedding for retrieval
*Distribution*
* This data is free to use and access by subsequent students of our project.
*Maintenance*
There is currently no system in place for cleanly updating the data, though in our instructions within WRITEUP.md we include a way to ingest your own data from the API and embed it.
|