Spaces:

nomadicsynth
/

inkling

Sleeping

App Files Files Community

inkling / README.md

nomadicsynth

Update README to enhance image description and correct terminology for clarity

1c33652 17 days ago

preview code

raw

history blame

6.3 kB

	---
	title: Inkling
	emoji: 🌐
	colorFrom: indigo
	colorTo: yellow
	# python_version: 3.10
	sdk: gradio
	sdk_version: 5.29.0
	app_file: app.py
	pinned: true
	license: agpl-3.0
	short_description: Use AI to find obvious research links in unexpected places.
	datasets:
	- nomadicsynth/arxiv-dataset-abstract-embeddings
	models:
	- nomadicsynth/research-compass-arxiv-abstracts-embedding-model
	---

	# Inkling: Bridging the Unconnected in Scientific Literature

	![Inkling Logo - A cartoon squid with a smile. The human-like brain is visible.](https://huggingface.co/spaces/nomadicsynth/inkling/resolve/main/inkling-logo.png)

	Inkling is an experimental bridge-finding engine for scientific literature, built to uncover latent connections between research papers—relationships that are obvious in hindsight but buried under the sheer volume of modern research. It’s inspired by the work of Don R. Swanson, the visionary who discovered the link between fish oil and Raynaud’s syndrome using nothing but manual literature analysis. Today, we call this approach Literature-Based Discovery - and Inkling is our attempt to automate it with modern NLP.

	---

	## The Problem: Lost in the Literature

	The scientific literature is growing exponentially, but human researchers can only read so much. As Sabine Hossenfelder explained in her 2024 YouTube video ["AIs Predict Research Results Without Doing Research"](https://www.youtube.com/watch?v=Qgrl3JSWWDE), even experts miss critical connections because no one has time to read everything. Swanson’s 1986 discovery of the fish oil–Raynaud’s link was a wake-up call: the knowledge existed in plain sight, but the papers were siloed. Inkling is our attempt to fix that.

	---

	## The Vision: A Bridge-Finding Machine

	Inkling isn’t just a search engine. It’s a hypothesis generator. It learns to recognize intermediate concepts that connect seemingly unrelated papers—like Swanson’s "blood viscosity" bridge. The model is built to:

	- Find indirect links between papers that don’t cite each other.
	- Surface connections that feel obvious once explained but are buried in the noise.
	- Scale to the entire arXiv corpus and beyond.

	---

	## How It Works

	### Model Architecture

	- Base Model: A `SentenceTransformer` using Llama-7B as its base (with frozen weights) and a dense embedding head.
	- Training:
	- v1: Trained on a synthetic dataset of randomly paired papers, rated for conceptual overlap.
	- v2 (in progress): Focused on bridge detection, using prompts to explicitly identify intermediate concepts (e.g., "What connects these two papers?").
	- Embedding Strategy:
	- Dense vector representations of abstracts.
	- FAISS for fast approximate nearest-neighbor search.

	### Dataset Philosophy

	- v1: Random paper pairs rated for generic "relevance" (too broad, limited bridge detection).
	- v2: Focus on explicit bridge extraction using LLM-generated triplets (e.g., "Paper A → Bridge Concept → Paper B").

	---

	## The Inspiration

	This project was born from a nerd-sniping moment after watching Sabine Hossenfelder’s video on AI’s ability to predict neuroscience results without experiments. That led to three key influences:

	### 1. Swanson’s "Undiscovered Public Knowledge"

	Swanson’s 1986 paper showed that the fish oil–Raynaud’s link existed in the literature for decades—it just took a human to connect the dots. Inkling automates this process.

	### 2. Tshitoyan et al. (2019): Word Embeddings in Materials Science

	Their work demonstrated that unsupervised embeddings could predict future material discoveries from latent knowledge. Inkling applies this idea to conceptual bridges in all scientific fields.

	### 3. Luo et al. (2024): LLMs Beat Human Experts

	This study showed that a 7B LLM (like Mistral) could outperform neuroscientists in predicting experimental outcomes. Inkling leverages this power to find connections even domain experts might miss.

	---

	## What It Can Do (and What’s Next)

	### Current Capabilities

	- Embed arXiv abstracts into dense vectors.
	- Search for papers with conceptual overlap (50% relevance in top-10/25 queries, per manual testing).
	- Visualize results in a Gradio interface with FAISS-powered speed.

	### Roadmap

	- v2: Train on LLM-generated bridge triplets (e.g., "Paper A → Blood Viscosity → Paper B").
	- Gradio Enhancements:
	- Interactive bridge visualization (D3.js or Plotly).
	- User feedback loop for improving the model.
	- Automated Updates: Embed new arXiv papers nightly.
	- Domain-Specific Tools:
	- Drug repurposing mode (e.g., "Find new uses for aspirin").
	- Interdisciplinary connection finder (e.g., "How does physics inform AI research?").

	---

	## Why This Matters

	Inkling is not a polished product—it’s a chaotic, ADHD-fueled experiment in democratizing scientific discovery. It’s for:

	- Researchers drowning in paper overload.
	- Interdisciplinary thinkers who thrive on unexpected connections.
	- Anyone who’s ever thought, "I could’ve thought of that!" after a breakthrough.

	As Sabine Hossenfelder put it: "The future of research isn’t in doing more experiments—it’s in connecting the dots we already have." - Citation needed.

	---

	## Status

	- Model: v1 (proof of concept, 50-50 if it does anything or my brain is just playing tricks).
	- Dataset: v1 (random pairs, too broad). v2 (in planning, focused on bridge detection).
	- Interface: Gradio-powered demo with FAISS backend.
	- Next Steps: Refine training data, automate updates, and scale to all of arXiv.

	---

	## Credits

	- Inspiration: Sabine Hossenfelder’s ["AIs Predict Research Results" video](https://www.youtube.com/watch?v=Qgrl3JSWWDE).
	- Foundational Work: Don R. Swanson, V. Tshitoyan, X. Luo.
	- Model Architecture: Llama-7B + SentenceTransformer.

	---

	## Try It

	[Live Demo](https://nomadicsynth-research-compass.hf.space)
	Paste an abstract, find a bridge, and see if the connection feels obvious in hindsight. 🚀

	---

	This is a work in progress. Feedback, ideas, and nerd-sniped collaborators are welcome.