Spaces:
Sleeping
Sleeping
title: Inkling | |
emoji: 🌐 | |
colorFrom: indigo | |
colorTo: yellow | |
# python_version: 3.10 | |
sdk: gradio | |
sdk_version: 5.29.0 | |
app_file: app.py | |
pinned: true | |
license: agpl-3.0 | |
short_description: Use AI to find obvious research links in unexpected places. | |
datasets: | |
- nomadicsynth/arxiv-dataset-abstract-embeddings | |
models: | |
- nomadicsynth/research-compass-arxiv-abstracts-embedding-model | |
# Inkling: Bridging the Unconnected in Scientific Literature | |
 | |
**Inkling** is an experimental bridge-finding engine for scientific literature, built to uncover *latent connections* between research papers—relationships that are obvious in hindsight but buried under the sheer volume of modern research. It’s inspired by the work of **Don R. Swanson**, the visionary who discovered the link between *fish oil* and *Raynaud’s syndrome* using nothing but manual literature analysis. Today, we call this approach **Literature-Based Discovery** - and Inkling is our attempt to automate it with modern NLP. | |
--- | |
## The Problem: Lost in the Literature | |
The scientific literature is growing exponentially, but human researchers can only read so much. As Sabine Hossenfelder explained in her 2024 YouTube video ["AIs Predict Research Results Without Doing Research"](https://www.youtube.com/watch?v=Qgrl3JSWWDE), even experts miss critical connections because no one has time to read everything. Swanson’s 1986 discovery of the fish oil–Raynaud’s link was a wake-up call: the knowledge existed in plain sight, but the papers were siloed. Inkling is our attempt to fix that. | |
--- | |
## The Vision: A Bridge-Finding Machine | |
Inkling isn’t just a search engine. It’s a **hypothesis generator**. It learns to recognize *intermediate concepts* that connect seemingly unrelated papers—like Swanson’s "blood viscosity" bridge. The model is built to: | |
- **Find indirect links** between papers that don’t cite each other. | |
- **Surface connections** that feel obvious once explained but are buried in the noise. | |
- **Scale** to the entire arXiv corpus and beyond. | |
--- | |
## How It Works | |
### Model Architecture | |
- **Base Model**: A `SentenceTransformer` using **Llama-7B** as its base (with frozen weights) and a dense embedding head. | |
- **Training**: | |
- v1: Trained on a synthetic dataset of randomly paired papers, rated for conceptual overlap. | |
- v2 (in progress): Focused on *bridge detection*, using prompts to explicitly identify intermediate concepts (e.g., "What connects these two papers?"). | |
- **Embedding Strategy**: | |
- Dense vector representations of abstracts. | |
- FAISS for fast approximate nearest-neighbor search. | |
### Dataset Philosophy | |
- v1: Random paper pairs rated for generic "relevance" (too broad, limited bridge detection). | |
- v2: Focus on **explicit bridge extraction** using LLM-generated triplets (e.g., "Paper A → Bridge Concept → Paper B"). | |
--- | |
## The Inspiration | |
This project was born from a **nerd-sniping moment** after watching Sabine Hossenfelder’s video on AI’s ability to predict neuroscience results without experiments. That led to three key influences: | |
### 1. **Swanson’s "Undiscovered Public Knowledge"** | |
Swanson’s 1986 paper showed that the fish oil–Raynaud’s link existed in the literature for decades—it just took a human to connect the dots. Inkling automates this process. | |
### 2. **Tshitoyan et al. (2019): Word Embeddings in Materials Science** | |
Their work demonstrated that unsupervised embeddings could predict future material discoveries from latent knowledge. Inkling applies this idea to *conceptual bridges* in all scientific fields. | |
### 3. **Luo et al. (2024): LLMs Beat Human Experts** | |
This study showed that a 7B LLM (like Mistral) could outperform neuroscientists in predicting experimental outcomes. Inkling leverages this power to find connections even domain experts might miss. | |
--- | |
## What It Can Do (and What’s Next) | |
### Current Capabilities | |
- Embed arXiv abstracts into dense vectors. | |
- Search for papers with conceptual overlap (50% relevance in top-10/25 queries, per manual testing). | |
- Visualize results in a Gradio interface with FAISS-powered speed. | |
### Roadmap | |
- **v2**: Train on LLM-generated bridge triplets (e.g., "Paper A → Blood Viscosity → Paper B"). | |
- **Gradio Enhancements**: | |
- Interactive bridge visualization (D3.js or Plotly). | |
- User feedback loop for improving the model. | |
- **Automated Updates**: Embed new arXiv papers nightly. | |
- **Domain-Specific Tools**: | |
- Drug repurposing mode (e.g., "Find new uses for aspirin"). | |
- Interdisciplinary connection finder (e.g., "How does physics inform AI research?"). | |
--- | |
## Why This Matters | |
Inkling is **not** a polished product—it’s a chaotic, ADHD-fueled experiment in democratizing scientific discovery. It’s for: | |
- Researchers drowning in paper overload. | |
- Interdisciplinary thinkers who thrive on unexpected connections. | |
- Anyone who’s ever thought, *"I could’ve thought of that!"* after a breakthrough. | |
As Sabine Hossenfelder put it: *"The future of research isn’t in doing more experiments—it’s in connecting the dots we already have."* - Citation needed. | |
--- | |
## Status | |
- **Model**: v1 (proof of concept, 50-50 if it does anything or my brain is just playing tricks). | |
- **Dataset**: v1 (random pairs, too broad). v2 (in planning, focused on bridge detection). | |
- **Interface**: Gradio-powered demo with FAISS backend. | |
- **Next Steps**: Refine training data, automate updates, and scale to all of arXiv. | |
--- | |
## Credits | |
- **Inspiration**: Sabine Hossenfelder’s ["AIs Predict Research Results" video](https://www.youtube.com/watch?v=Qgrl3JSWWDE). | |
- **Foundational Work**: Don R. Swanson, V. Tshitoyan, X. Luo. | |
- **Model Architecture**: Llama-7B + SentenceTransformer. | |
--- | |
## Try It | |
[**Live Demo**](https://nomadicsynth-research-compass.hf.space) | |
*Paste an abstract, find a bridge, and see if the connection feels obvious in hindsight.* 🚀 | |
--- | |
**This is a work in progress. Feedback, ideas, and nerd-sniped collaborators are welcome.** | |