File size: 2,553 Bytes
a9e136f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
***Project Information*** 

* The project name is LibRAG (Retrieval Augmented Generation)
* https://github.com/BU-Spark/ml-bpl-rag/tree/main   
* [Google Drive](https://drive.google.com/drive/folders/12_tsVcUgwdfUdXalD67NOgUL3tGeI6ss?usp=sharing)
* This project involved implementing natural language querying into the Digial Commonwealth project.
* Client: Boston Public Library
* Contact: Eben English 
* Class: DS549

***Dataset Information***

* Our data is contained on the SCC at /projectnb/sparkgrp/ml-bpl-rag-data  
  * /vectorstore/final_embeddings/metadata_index - faiss index for the metadata
  * /vectorstore/final_embeddings/fulltext_index - faiss index for the OCR text
  * /full_data/bpl_data.json - metadata
  * /full_data/clean_ft.json - fulltext
* We did not have formal datasets, instead we used the Digital Commonwealth API and created embeddings from it. There is no need for a data dictionary outside of [Digital Commonwealth API](https://github.com/boston-library/solr-core-conf/wiki/SolrDocument-field-reference:-public-API).
* What keywords or tags would you attach to the data set?  
  * Domain(s) of Application: Natural Language Processing, Library Science 
  * Civic tech

*The following questions pertain to the datasets you used in your project.*   
*Motivation* 

* We needed to create embeddings of the Digital Commonwealth's data in order to perform retrieval

*Composition*

* Each entry in the Digital Commonwealth API represents an object in their repo of varying format  
* There were ~1.3 million total objects last we checked, about 147,000 of which containing full-text from OCR'd documents. 
* Our data was a comprehensive snapshot, the API is being updated.
* Each field from the API represented metadata classifications   
* Data is publicly accessible and non-confidential
  
*Collection Process*

* We collected data from an API endpoint.
* No sampling was performed
* This data was collected in October 2024

*Preprocessing/cleaning/labeling* 

* Very limited character correction was performed on the fulltext data.
* No transformations were applied outside of embedding.
* The raw data is saved in ml-bpl-rag-data/full_data/bpl_data.json (metadata) clean_ft.json (fulltext)

*Uses* 

* Embedding for retrieval

*Distribution*

* This data is free to use and access by subsequent students of our project.

*Maintenance* 

There is currently no system in place for cleanly updating the data, though in our instructions within WRITEUP.md we include a way to ingest your own data from the API and embed it.