Biblio-glutton database and index (2024-04 Crossref dump)
This repository contains the Biblio-glutton (https://github.com/kermitt2/biblio-glutton) databases and indexes
Due to the limitation of the HF upload size to a maximum of 20 GB, we have compressed the files into chunks.
To decompresss, 7 Zip is required.
The repository contains the following files:
biblio-glutton-index.7z.*
: contains the Zipped Elasticsearch index and engine.- data/db: contains the LMDB fast storage databases:
- data/db/crossref.zip: the crossref dump (2025/04)
- data/db/hal.7z*: the HAL identifiers
- data/db/pmid.7z*: the PMID identifiers mapping
- data/db/unpayWall.7z*: the unpayWall OA links (this comes from an old dump of 2018, we are planning to replace it with OpenALEX)
Getting started
Assuming you are in /home/user/glutton
- Clone the biblio-glutton application
git clone https://github.com/kermitt/biblio-glutton
You should have biblio-glutton under a directory of the same name.
- Clone this repository
git lfs install
git clone https://huggingface.co/sciencialab/biblio-glutton-dbs
- Unpack the Index
7z x biblio-glutton-dbs/biblio-glutton-index.7z (make sure to match the filename)
You should have a new directory biblio-glutton-index
organised as follow:
biblio-glutton-index
βββ elastic
β βββ elastico_singleNode
β βββ elastico_singleNode.sh
β βββ config
β βββ data
β βββ logs
βββ elasticsearch-8.15.0
You need to edit elastico_singleNode.sh
and elastico_singleNode/config/
to replace the data
and logs
absolute paths that match your machine.
Then you can run the index by running:
sh elastic/elastico_singleNode.sh
- Unpack the Database
It's better to leave the data outside the biblio-glutton application, so moving it in your root /home/user/glutton/data
could be a solution.
mv biblio-glutton-dbs/data/db .
7z x data/db/*.7z (here also make sure you match the filenames)
Then you need to update the biblio-glutton/config/config.yml
config file in biblio-glutton application to match the database path