Spaces:
Running
on
Zero
Running
on
Zero
Commit
·
c138584
1
Parent(s):
63510a9
Revise README to enhance clarity and detail about Inkling's purpose and functionality
Browse files
README.md
CHANGED
@@ -16,54 +16,122 @@ models:
|
|
16 |
- nomadicsynth/research-compass-arxiv-abstracts-embedding-model
|
17 |
---
|
18 |
|
19 |
-
# Inkling:
|
20 |
|
21 |
-

|
22 |
|
23 |
-
|
24 |
|
25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
-
|
|
|
28 |
|
29 |
---
|
30 |
|
31 |
-
##
|
32 |
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
|
39 |
---
|
40 |
|
41 |
-
##
|
42 |
|
43 |
-
|
44 |
|
45 |
-
|
|
|
|
|
46 |
|
47 |
-
|
48 |
|
49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
|
51 |
---
|
52 |
|
53 |
-
## Why
|
|
|
|
|
54 |
|
55 |
-
|
|
|
|
|
56 |
|
57 |
-
|
58 |
|
59 |
---
|
60 |
|
61 |
-
##
|
62 |
|
63 |
-
|
|
|
|
|
|
|
64 |
|
65 |
---
|
66 |
|
67 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
68 |
|
69 |
-
|
|
|
16 |
- nomadicsynth/research-compass-arxiv-abstracts-embedding-model
|
17 |
---
|
18 |
|
19 |
+
# Inkling: Bridging the Unconnected in Scientific Literature
|
20 |
|
21 |
+

|
22 |
|
23 |
+
**Inkling** is an experimental bridge-finding engine for scientific literature, built to uncover *latent connections* between research papers—relationships that are obvious in hindsight but buried under the sheer volume of modern research. It’s inspired by the work of **Don R. Swanson**, the visionary who discovered the link between *fish oil* and *Raynaud’s syndrome* using nothing but manual literature analysis. Today, we call this approach **undiscovered public knowledge**—and Inkling is our attempt to automate it with modern NLP.
|
24 |
|
25 |
+
---
|
26 |
+
|
27 |
+
## The Problem: Lost in the Literature
|
28 |
+
|
29 |
+
The scientific literature is growing exponentially, but human researchers can only read so much. As Sabine Hossenfelder explained in her 2024 YouTube video ["AIs Predict Research Results Without Doing Research"](https://www.youtube.com/watch?v=Qgrl3JSWWDE), even experts miss critical connections because no one has time to read everything. Swanson’s 1986 discovery of the fish oil–Raynaud’s link was a wake-up call: the knowledge existed in plain sight, but the papers were siloed. Inkling is our attempt to fix that.
|
30 |
+
|
31 |
+
---
|
32 |
+
|
33 |
+
## The Vision: A Bridge-Finding Machine
|
34 |
+
|
35 |
+
Inkling isn’t just a search engine. It’s a **hypothesis generator**. It learns to recognize *intermediate concepts* that connect seemingly unrelated papers—like Swanson’s "blood viscosity" bridge. The model is built to:
|
36 |
+
|
37 |
+
- **Find indirect links** between papers that don’t cite each other.
|
38 |
+
- **Surface connections** that feel obvious once explained but are buried in the noise.
|
39 |
+
- **Scale** to the entire arXiv corpus and beyond.
|
40 |
+
|
41 |
+
---
|
42 |
+
|
43 |
+
## How It Works
|
44 |
+
|
45 |
+
### Model Architecture
|
46 |
+
|
47 |
+
- **Base Model**: A `SentenceTransformer` using **Llama-7B** as its base (with frozen weights) and a dense embedding head.
|
48 |
+
- **Training**:
|
49 |
+
- v1: Trained on a synthetic dataset of randomly paired papers, rated for conceptual overlap.
|
50 |
+
- v2 (in progress): Focused on *bridge detection*, using prompts to explicitly identify intermediate concepts (e.g., "What connects these two papers?").
|
51 |
+
- **Embedding Strategy**:
|
52 |
+
- Dense vector representations of abstracts.
|
53 |
+
- FAISS for fast approximate nearest-neighbor search.
|
54 |
+
|
55 |
+
### Dataset Philosophy
|
56 |
|
57 |
+
- v1: Random paper pairs rated for generic "relevance" (too broad, limited bridge detection).
|
58 |
+
- v2: Focus on **explicit bridge extraction** using LLM-generated triplets (e.g., "Paper A → Bridge Concept → Paper B").
|
59 |
|
60 |
---
|
61 |
|
62 |
+
## The Inspiration
|
63 |
|
64 |
+
This project was born from a **nerd-sniping moment** after watching Sabine Hossenfelder’s video on AI’s ability to predict neuroscience results without experiments. That led to three key influences:
|
65 |
+
|
66 |
+
### 1. **Swanson’s "Undiscovered Public Knowledge"**
|
67 |
+
|
68 |
+
Swanson’s 1986 paper showed that the fish oil–Raynaud’s link existed in the literature for decades—it just took a human to connect the dots. Inkling automates this process.
|
69 |
+
|
70 |
+
### 2. **Tshitoyan et al. (2019): Word Embeddings in Materials Science**
|
71 |
+
|
72 |
+
Their work demonstrated that unsupervised embeddings could predict future material discoveries from latent knowledge. Inkling applies this idea to *conceptual bridges* in all scientific fields.
|
73 |
+
|
74 |
+
### 3. **Luo et al. (2024): LLMs Beat Human Experts**
|
75 |
+
|
76 |
+
This study showed that a 7B LLM (like Mistral) could outperform neuroscientists in predicting experimental outcomes. Inkling leverages this power to find connections even domain experts might miss.
|
77 |
|
78 |
---
|
79 |
|
80 |
+
## What It Can Do (and What’s Next)
|
81 |
|
82 |
+
### Current Capabilities
|
83 |
|
84 |
+
- Embed arXiv abstracts into dense vectors.
|
85 |
+
- Search for papers with conceptual overlap (50% relevance in top-10/25 queries, per manual testing).
|
86 |
+
- Visualize results in a Gradio interface with FAISS-powered speed.
|
87 |
|
88 |
+
### Roadmap
|
89 |
|
90 |
+
- **v2**: Train on LLM-generated bridge triplets (e.g., "Paper A → Blood Viscosity → Paper B").
|
91 |
+
- **Gradio Enhancements**:
|
92 |
+
- Interactive bridge visualization (D3.js or Plotly).
|
93 |
+
- User feedback loop for improving the model.
|
94 |
+
- **Automated Updates**: Embed new arXiv papers nightly.
|
95 |
+
- **Domain-Specific Tools**:
|
96 |
+
- Drug repurposing mode (e.g., "Find new uses for aspirin").
|
97 |
+
- Interdisciplinary connection finder (e.g., "How does physics inform AI research?").
|
98 |
|
99 |
---
|
100 |
|
101 |
+
## Why This Matters
|
102 |
+
|
103 |
+
Inkling is **not** a polished product—it’s a chaotic, ADHD-fueled experiment in democratizing scientific discovery. It’s for:
|
104 |
|
105 |
+
- Researchers drowning in paper overload.
|
106 |
+
- Interdisciplinary thinkers who thrive on unexpected connections.
|
107 |
+
- Anyone who’s ever thought, *"I could’ve thought of that!"* after a breakthrough.
|
108 |
|
109 |
+
As Sabine Hossenfelder put it: *"The future of research isn’t in doing more experiments—it’s in connecting the dots we already have."* - Citation needed.
|
110 |
|
111 |
---
|
112 |
|
113 |
+
## Status
|
114 |
|
115 |
+
- **Model**: v1 (proof of concept, 50-50 if it does anything or my brain is just playing tricks).
|
116 |
+
- **Dataset**: v1 (random pairs, too broad). v2 (in planning, focused on bridge detection).
|
117 |
+
- **Interface**: Gradio-powered demo with FAISS backend.
|
118 |
+
- **Next Steps**: Refine training data, automate updates, and scale to all of arXiv.
|
119 |
|
120 |
---
|
121 |
|
122 |
+
## Credits
|
123 |
+
|
124 |
+
- **Inspiration**: Sabine Hossenfelder’s ["AIs Predict Research Results" video](https://www.youtube.com/watch?v=Qgrl3JSWWDE).
|
125 |
+
- **Foundational Work**: Don R. Swanson, V. Tshitoyan, X. Luo.
|
126 |
+
- **Model Architecture**: Llama-7B + SentenceTransformer.
|
127 |
+
|
128 |
+
---
|
129 |
+
|
130 |
+
## Try It
|
131 |
+
|
132 |
+
[**Live Demo**](https://nomadicsynth-research-compass.hf.space)
|
133 |
+
*Paste an abstract, find a bridge, and see if the connection feels obvious in hindsight.* 🚀
|
134 |
+
|
135 |
+
---
|
136 |
|
137 |
+
**This is a work in progress. Feedback, ideas, and nerd-sniped collaborators are welcome.**
|