Dolphin Brothers Unite

community

AI & ML interests

None defined yet.

Recent Activity

DolphinBrothersUnite's activity

conceptofmind 
posted an update 9 months ago
view post
Post
2549
Teraflop AI is excited to help support the Caselaw Access Project and Harvard Library Innovation Lab, in the release of over 6.6 million state and federal court decisions published throughout U.S. history. It is important to democratize fair access to data to the public, legal community, and researchers. This is a processed and cleaned version of the original CAP data.

During the digitization of these texts, there were erroneous OCR errors that occurred. We worked to post-process each of the texts for model training to fix encoding, normalization, repetition, redundancy, parsing, and formatting.

Teraflop AI’s data engine allows for the massively parallel processing of web-scale datasets into cleaned text form.

Link to the processed dataset: https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_Project

The Caselaw Access Project dataset is licensed under the CC0 License.

We plan to release trillions of commercially licensed text tokens, images, audio, videos, and other datasets spanning numerous domains and modalities over the next months. If you are interested in contributing commercially licensed data be sure to reach out: https://twitter.com/EnricoShippole

Follow us for the next collaborative dataset releases: https://twitter.com/TeraflopAI
zpn 
posted an update 10 months ago
view post
Post
ICYMI! Nomic Embed v1.5: Resizable Production Embeddings with Matryoshka Representation Learning

- Variable embedding dimension from 64 <-> 768
- Outperforms text-embedding-ada-002 while achieving a 3x memory reduction
- Day 1 integrations with Langchain, LlamaIndex, MongoDB, and Sentence Transformers

Check out
nomic-ai/nomic-embed-text-v1.5 for the model weights.

Technical report: https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf
Blog Post: https://blog.nomic.ai/posts/nomic-embed-matryoshka
Original Tweet Thread: https://x.com/nomic_ai/status/1757782157374734665?s=20
zpn 
posted an update 11 months ago
view post
Post
ICYMI! Nomic Embed, the first fully open long context text embedder to beat OpenAI

- Open source, open weights, open data
- Beats OpenAI text-embeding-3-small and Ada on short and long context benchmarks
- Day 1 integrations with Langchain, LlamaIndex, MongoDB, and Sentence Transformers

Check out nomic-ai/nomic-embed-text-v1 for the model weights.

Technical report: https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf
Blog Post: https://blog.nomic.ai/posts/nomic-embed-text-v1
Original Tweet Thread: https://x.com/nomic_ai/status/1753082063048040829?s=20
  • 1 reply
·
conceptofmind 
posted an update 11 months ago
view post
Post
A 1b dense causal language model begins to "saturate" in terms of accuracy around 5 epochs on 1.2T tokens.