nomadicsynth commited on
Commit
c138584
·
1 Parent(s): 63510a9

Revise README to enhance clarity and detail about Inkling's purpose and functionality

Browse files
Files changed (1) hide show
  1. README.md +91 -23
README.md CHANGED
@@ -16,54 +16,122 @@ models:
16
  - nomadicsynth/research-compass-arxiv-abstracts-embedding-model
17
  ---
18
 
19
- # Inkling: AI-assisted research discovery
20
 
21
- ![Inkling](https://huggingface.co/spaces/nomadicsynth/inkling/resolve/main/inkling-logo.png)
22
 
23
- [**Inkling**](https://nomadicsynth-research-compass.hf.space) is an AI-assisted tool that helps you discover meaningful connections between research papers — the kind of links a domain expert might spot, if they had time to read everything.
24
 
25
- Rather than relying on superficial similarity or shared keywords, Inkling is trained to recognize **reasoning-based relationships** between papers. It evaluates conceptual, methodological, and application-level connections — even across disciplines — and surfaces links that may be overlooked due to the sheer scale of the research landscape.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
- This demo uses the first prototype of the model, trained on a dataset of **10,000+ rated abstract pairs**, built from a larger pool of arXiv triplets. The system will continue to improve with feedback and will be released alongside the dataset for public research.
 
28
 
29
  ---
30
 
31
- ## What it does
32
 
33
- - Accepts a research abstract, idea, or question
34
- - Searches for papers with **deep, contextual relevance**
35
- - Highlights key conceptual links and application overlaps
36
- - Offers reasoning-based analysis between selected papers
37
- - Gathers user feedback to improve the model over time
 
 
 
 
 
 
 
 
38
 
39
  ---
40
 
41
- ## Background and Motivation
42
 
43
- Scientific progress often depends on connecting ideas across papers, fields, and years of literature. But with the volume of research growing exponentially, it's increasingly difficult for any one person — or even a team — to stay on top of it all. As a result, valuable connections between papers often go unnoticed simply because the right expert never read both.
44
 
45
- In 2024, Luo et al. published a landmark study in *Nature Human Behaviour* showing that **large language models (LLMs) can outperform human experts** in predicting the results of neuroscience experiments by integrating knowledge across the scientific literature. Their model, **BrainGPT**, demonstrated how tuning a general-purpose LLM (like Mistral-7B) on domain-specific data could synthesize insights that surpass human forecasting ability. Notably, the authors found that models as small as 7B parameters performed well — an insight that influenced the foundation for this project.
 
 
46
 
47
- Inspired by this work — and a YouTube breakdown by physicist and science communicator **Sabine Hossenfelder**, titled *["AIs Predict Research Results Without Doing Research"](https://www.youtube.com/watch?v=Qgrl3JSWWDE)* — this project began as an attempt to explore similar methods of knowledge integration at the level of paper-pair relationships. Her clear explanation and commentary sparked the idea to apply this paradigm not just to forecasting outcomes, but to identifying latent connections between published studies.
48
 
49
- Originally conceived as a perplexity-ranking experiment using LLMs directly (mirroring Luo et al.'s evaluation method), the project gradually evolved into what it is now — **Inkling**, a reasoning-aware embedding model fine-tuned on LLM-rated abstract pairings, and built to help researchers uncover links that would be obvious — *if only someone had the time to read everything*.
 
 
 
 
 
 
 
50
 
51
  ---
52
 
53
- ## Why Inkling?
 
 
54
 
55
- > Because the right connection is often obvious — once someone points it out.
 
 
56
 
57
- Researchers today are overwhelmed by volume. Inkling helps restore those missed-but-meaningful links between ideas, methods, and fields links that could inspire new directions, clarify existing work, or enable cross-pollination across domains.
58
 
59
  ---
60
 
61
- ## Citation
62
 
63
- > Luo, X., Rechardt, A., Sun, G. et al. Large language models surpass human experts in predicting neuroscience results. *Nat Hum Behav* **9**, 305–315 (2025). [https://www.nature.com/articles/s41562-024-02046-9](https://www.nature.com/articles/s41562-024-02046-9)
 
 
 
64
 
65
  ---
66
 
67
- ## Status
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
- Inkling is in **alpha** and under active development. The current model is hosted via Gradio, with a Hugging Face Space available for live interaction and feedback. Contributions, feedback, and collaboration are welcome.
 
16
  - nomadicsynth/research-compass-arxiv-abstracts-embedding-model
17
  ---
18
 
19
+ # Inkling: Bridging the Unconnected in Scientific Literature
20
 
21
+ ![Inkling](https://huggingface.co/spaces/nomadicsynth/inkling/resolve/main/inkling-logo.png)
22
 
23
+ **Inkling** is an experimental bridge-finding engine for scientific literature, built to uncover *latent connections* between research papers—relationships that are obvious in hindsight but buried under the sheer volume of modern research. It’s inspired by the work of **Don R. Swanson**, the visionary who discovered the link between *fish oil* and *Raynaud’s syndrome* using nothing but manual literature analysis. Today, we call this approach **undiscovered public knowledge**—and Inkling is our attempt to automate it with modern NLP.
24
 
25
+ ---
26
+
27
+ ## The Problem: Lost in the Literature
28
+
29
+ The scientific literature is growing exponentially, but human researchers can only read so much. As Sabine Hossenfelder explained in her 2024 YouTube video ["AIs Predict Research Results Without Doing Research"](https://www.youtube.com/watch?v=Qgrl3JSWWDE), even experts miss critical connections because no one has time to read everything. Swanson’s 1986 discovery of the fish oil–Raynaud’s link was a wake-up call: the knowledge existed in plain sight, but the papers were siloed. Inkling is our attempt to fix that.
30
+
31
+ ---
32
+
33
+ ## The Vision: A Bridge-Finding Machine
34
+
35
+ Inkling isn’t just a search engine. It’s a **hypothesis generator**. It learns to recognize *intermediate concepts* that connect seemingly unrelated papers—like Swanson’s "blood viscosity" bridge. The model is built to:
36
+
37
+ - **Find indirect links** between papers that don’t cite each other.
38
+ - **Surface connections** that feel obvious once explained but are buried in the noise.
39
+ - **Scale** to the entire arXiv corpus and beyond.
40
+
41
+ ---
42
+
43
+ ## How It Works
44
+
45
+ ### Model Architecture
46
+
47
+ - **Base Model**: A `SentenceTransformer` using **Llama-7B** as its base (with frozen weights) and a dense embedding head.
48
+ - **Training**:
49
+ - v1: Trained on a synthetic dataset of randomly paired papers, rated for conceptual overlap.
50
+ - v2 (in progress): Focused on *bridge detection*, using prompts to explicitly identify intermediate concepts (e.g., "What connects these two papers?").
51
+ - **Embedding Strategy**:
52
+ - Dense vector representations of abstracts.
53
+ - FAISS for fast approximate nearest-neighbor search.
54
+
55
+ ### Dataset Philosophy
56
 
57
+ - v1: Random paper pairs rated for generic "relevance" (too broad, limited bridge detection).
58
+ - v2: Focus on **explicit bridge extraction** using LLM-generated triplets (e.g., "Paper A → Bridge Concept → Paper B").
59
 
60
  ---
61
 
62
+ ## The Inspiration
63
 
64
+ This project was born from a **nerd-sniping moment** after watching Sabine Hossenfelder’s video on AI’s ability to predict neuroscience results without experiments. That led to three key influences:
65
+
66
+ ### 1. **Swanson’s "Undiscovered Public Knowledge"**
67
+
68
+ Swanson’s 1986 paper showed that the fish oil–Raynaud’s link existed in the literature for decades—it just took a human to connect the dots. Inkling automates this process.
69
+
70
+ ### 2. **Tshitoyan et al. (2019): Word Embeddings in Materials Science**
71
+
72
+ Their work demonstrated that unsupervised embeddings could predict future material discoveries from latent knowledge. Inkling applies this idea to *conceptual bridges* in all scientific fields.
73
+
74
+ ### 3. **Luo et al. (2024): LLMs Beat Human Experts**
75
+
76
+ This study showed that a 7B LLM (like Mistral) could outperform neuroscientists in predicting experimental outcomes. Inkling leverages this power to find connections even domain experts might miss.
77
 
78
  ---
79
 
80
+ ## What It Can Do (and What’s Next)
81
 
82
+ ### Current Capabilities
83
 
84
+ - Embed arXiv abstracts into dense vectors.
85
+ - Search for papers with conceptual overlap (50% relevance in top-10/25 queries, per manual testing).
86
+ - Visualize results in a Gradio interface with FAISS-powered speed.
87
 
88
+ ### Roadmap
89
 
90
+ - **v2**: Train on LLM-generated bridge triplets (e.g., "Paper A Blood Viscosity Paper B").
91
+ - **Gradio Enhancements**:
92
+ - Interactive bridge visualization (D3.js or Plotly).
93
+ - User feedback loop for improving the model.
94
+ - **Automated Updates**: Embed new arXiv papers nightly.
95
+ - **Domain-Specific Tools**:
96
+ - Drug repurposing mode (e.g., "Find new uses for aspirin").
97
+ - Interdisciplinary connection finder (e.g., "How does physics inform AI research?").
98
 
99
  ---
100
 
101
+ ## Why This Matters
102
+
103
+ Inkling is **not** a polished product—it’s a chaotic, ADHD-fueled experiment in democratizing scientific discovery. It’s for:
104
 
105
+ - Researchers drowning in paper overload.
106
+ - Interdisciplinary thinkers who thrive on unexpected connections.
107
+ - Anyone who’s ever thought, *"I could’ve thought of that!"* after a breakthrough.
108
 
109
+ As Sabine Hossenfelder put it: *"The future of research isn’t in doing more experimentsit’s in connecting the dots we already have."* - Citation needed.
110
 
111
  ---
112
 
113
+ ## Status
114
 
115
+ - **Model**: v1 (proof of concept, 50-50 if it does anything or my brain is just playing tricks).
116
+ - **Dataset**: v1 (random pairs, too broad). v2 (in planning, focused on bridge detection).
117
+ - **Interface**: Gradio-powered demo with FAISS backend.
118
+ - **Next Steps**: Refine training data, automate updates, and scale to all of arXiv.
119
 
120
  ---
121
 
122
+ ## Credits
123
+
124
+ - **Inspiration**: Sabine Hossenfelder’s ["AIs Predict Research Results" video](https://www.youtube.com/watch?v=Qgrl3JSWWDE).
125
+ - **Foundational Work**: Don R. Swanson, V. Tshitoyan, X. Luo.
126
+ - **Model Architecture**: Llama-7B + SentenceTransformer.
127
+
128
+ ---
129
+
130
+ ## Try It
131
+
132
+ [**Live Demo**](https://nomadicsynth-research-compass.hf.space)
133
+ *Paste an abstract, find a bridge, and see if the connection feels obvious in hindsight.* 🚀
134
+
135
+ ---
136
 
137
+ **This is a work in progress. Feedback, ideas, and nerd-sniped collaborators are welcome.**