Word2Vec in a non overengineered file format
This is the Word2Vec words embedding collection by Google(TM) converted in a sane binary format that you can read with a trivial program, without resorting to complicated libraries.
This is the format:
- All the integers are unsigned, in little endian format.
- Floating point numbers of the embedding vectors are stored as single precision floating point numbers, float32.
The file starts with a header of 8 bytes:
uint32_t num_entries // Number of entries: 3000000 in this file
uint32_t dimension // Embedding dimension: 300 in this file
Then each entry is stored as:
uint16_t word_len // Length of the UTF-8 encoded word.
word_len bytes // UTF-8 encoded word.
float[300] // 300 float32 numbers, 1200 bytes in total.
The license is the same as Google's original Word2Vec, Apache 2.0.
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.