synthesist / app.py
kiyer's picture
basic files and codebase
6931cbb verified
raw
history blame
2.58 kB
import streamlit as st
from fns import *
st.set_page_config(
page_title="Synthesist",
page_icon="πŸ‘‹",
)
# st.write("# Welcome to Pathfinder! πŸ‘‹")
st.image('local_files/synth_logo.png')
st.sidebar.success("Select a function above.")
st.sidebar.markdown("Current functions include visualizing papers in the arxiv embedding, searching for similar papers to an input paper or prompt phrase, or answering quick questions.")
st.markdown("")
st.markdown(
"""
**Synthesist** (from Peter Watt's [Blindsight](https://scalar.usc.edu/works/network-ecologies/on-peter-watts-blindsight)) is a framework for searching and visualizing papers on the [arXiv](https://arxiv.org/) using the context
sensitivity from modern large language models (LLMs) to better parse patterns in paper contexts.
This tool was built during the [JSALT workshop](https://www.clsp.jhu.edu/2024-jelinek-summer-workshop-on-speech-and-language-technology/) to do awesome things.
**πŸ‘ˆ Select a tool from the sidebar** to see some examples
of what this framework can do!
### Tool summary:
- Please wait while the initial data loads and compiles, this takes about a minute initially.
- `Paper search` looks for relevant papers given an arxiv id or a question.
This is not meant to be a replacement to existing tools like the
[ADS](https://ui.adsabs.harvard.edu/),
[arxivsorter](https://www.arxivsorter.org/), semantic search or google scholar, but rather a supplement to find papers
that otherwise might be missed during a literature survey.
It is trained on astro-ph (astrophysics of galaxies) papers up to last-year-ish mined from arxiv and supplemented with ADS metadata,
if you are interested in extending it please reach out!
Also add: more pages, actual generation, diff. toggles for retrieval/gen, feedback form, socials, literature, contact us, copyright, collaboration, etc.
The image below shows a representation of all the astro-ph.GA papers that can be explored in more detail
using the `Arxiv embedding` page. The papers tend to cluster together by similarity, and result in an
atlas that shows well studied (forests) and currently uncharted areas (water).
"""
)
s = time.time()
st.markdown(f'Loading data for retrieval system, please wait before jumping to one of the pages....')
st.session_state.retrieval_system = EmbeddingRetrievalSystem()
st.session_state.dataset = load_dataset('arxiv_corpus/', split = "train")
st.markdown(f'Loaded retrieval system, time taken: %.1f sec' %(time.time()-s))