Biblio-glutton database and index (2024-04 Crossref dump)

This repository contains the Biblio-glutton (https://github.com/kermitt2/biblio-glutton) databases and indexes

Due to the limitation of the HF upload size to a maximum of 20 GB, we have compressed the files into chunks.

To decompresss, 7 Zip is required.

The repository contains the following files:

biblio-glutton-index.7z.*: contains the Zipped Elasticsearch index and engine.
data/db: contains the LMDB fast storage databases:
- data/db/crossref.zip: the crossref dump (2025/04)
- data/db/hal.7z*: the HAL identifiers
- data/db/pmid.7z*: the PMID identifiers mapping
- data/db/unpayWall.7z*: the unpayWall OA links (this comes from an old dump of 2018, we are planning to replace it with OpenALEX)

Getting started

Assuming you are in /home/user/glutton

Clone the biblio-glutton application

git clone https://github.com/kermitt/biblio-glutton

You should have biblio-glutton under a directory of the same name.

Clone this repository

git lfs install
git clone https://huggingface.co/sciencialab/biblio-glutton-dbs

Unpack the Index

7z x biblio-glutton-dbs/biblio-glutton-index.7z (make sure to match the filename)

You should have a new directory biblio-glutton-index organised as follow:

biblio-glutton-index
├── elastic
│   ├── elastico_singleNode
│   └── elastico_singleNode.sh
│        ├── config
│        ├── data
│        └── logs
└── elasticsearch-8.15.0

You need to edit elastico_singleNode.sh and elastico_singleNode/config/ to replace the data and logs absolute paths that match your machine.

Then you can run the index by running:

sh elastic/elastico_singleNode.sh

Unpack the Database

It's better to leave the data outside the biblio-glutton application, so moving it in your root /home/user/glutton/data could be a solution.

mv biblio-glutton-dbs/data/db . 
7z x data/db/*.7z (here also make sure you match the filenames)

Then you need to update the biblio-glutton/config/config.yml config file in biblio-glutton application to match the database path