kartheikiyer commited on
Commit
a8ea5a3
·
1 Parent(s): 88c92ac

fixed a missing prompt

Browse files
.ipynb_checkpoints/app_gradio-checkpoint.py ADDED
@@ -0,0 +1,684 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import numpy as np
3
+ from abc import ABC, abstractmethod
4
+ from typing import List, Dict, Any, Tuple
5
+ from collections import defaultdict
6
+ import pandas as pd
7
+ from datetime import datetime, date
8
+ from datasets import load_dataset, load_from_disk
9
+ from collections import Counter
10
+
11
+ import yaml, json, requests, sys, os, time
12
+ import urllib.parse
13
+ import concurrent.futures
14
+
15
+ from langchain import hub
16
+ from langchain_openai import ChatOpenAI as openai_llm
17
+ from langchain_openai import OpenAIEmbeddings
18
+ from langchain_core.runnables import RunnableConfig, RunnablePassthrough, RunnableParallel
19
+ from langchain_core.prompts import PromptTemplate
20
+ from langchain_community.callbacks import StreamlitCallbackHandler
21
+ from langchain_community.utilities import DuckDuckGoSearchAPIWrapper
22
+ from langchain_community.vectorstores import Chroma
23
+ from langchain_community.document_loaders import TextLoader
24
+ from langchain.agents import create_react_agent, Tool, AgentExecutor
25
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
26
+ from langchain_core.output_parsers import StrOutputParser
27
+ from langchain.callbacks import FileCallbackHandler
28
+ from langchain.callbacks.manager import CallbackManager
29
+ from langchain.schema import Document
30
+
31
+ import instructor
32
+ from pydantic import BaseModel, Field
33
+ from typing import List, Literal
34
+
35
+ from nltk.corpus import stopwords
36
+ import nltk
37
+ from openai import OpenAI, moderations
38
+ # import anthropic
39
+ import cohere
40
+ import faiss
41
+ import matplotlib.pyplot as plt
42
+ import spacy
43
+ from string import punctuation
44
+ import pytextrank
45
+ from prompts import *
46
+
47
+ openai_key = os.environ['openai_key']
48
+ cohere_key = os.environ['cohere_key']
49
+ os.environ["OPENAI_API_KEY"] = os.environ['openai_key']
50
+
51
+ def load_nlp():
52
+ nlp = spacy.load("en_core_web_sm")
53
+ nlp.add_pipe("textrank")
54
+ try:
55
+ stopwords.words('english')
56
+ except:
57
+ nltk.download('stopwords')
58
+ stopwords.words('english')
59
+ return nlp
60
+
61
+ gen_llm = openai_llm(temperature=0, model_name='gpt-4o-mini', openai_api_key = openai_key)
62
+ consensus_client = instructor.patch(OpenAI(api_key=openai_key))
63
+ embed_client = OpenAI(api_key = openai_key)
64
+ embed_model = "text-embedding-3-small"
65
+ embeddings = OpenAIEmbeddings(model = embed_model, api_key = openai_key)
66
+ nlp = load_nlp()
67
+
68
+ def check_mod(query):
69
+ mod_report = moderations.create(input=query)
70
+ for i in mod_report.results[0].categories:
71
+ if i[1] == True:
72
+ return True
73
+ return False
74
+
75
+ def get_keywords(text, nlp=nlp):
76
+ result = []
77
+ pos_tag = ['PROPN', 'ADJ', 'NOUN']
78
+ doc = nlp(text.lower())
79
+ for token in doc:
80
+ if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
81
+ continue
82
+ if(token.pos_ in pos_tag):
83
+ result.append(token.text)
84
+ return result
85
+
86
+ def load_arxiv_corpus():
87
+ # arxiv_corpus = load_from_disk('data/')
88
+ # arxiv_corpus.load_faiss_index('embed', 'data/astrophindex.faiss')
89
+
90
+ # keeping it up to date with the dataset
91
+ arxiv_corpus = load_dataset('kiyer/pathfinder_arxiv_data', split='train')
92
+ arxiv_corpus.add_faiss_index(column='embed')
93
+ print('loading arxiv corpus from disk')
94
+ return arxiv_corpus
95
+
96
+ class RetrievalSystem():
97
+
98
+ def __init__(self):
99
+
100
+ self.dataset = arxiv_corpus
101
+ self.client = OpenAI(api_key = openai_key)
102
+ self.embed_model = "text-embedding-3-small"
103
+ self.generation_client = openai_llm(temperature=0,model_name='gpt-4o-mini', openai_api_key = openai_key)
104
+ self.hyde_client = openai_llm(temperature=0.5,model_name='gpt-4o-mini', openai_api_key = openai_key)
105
+ self.cohere_client = cohere.Client(cohere_key)
106
+
107
+ def make_embedding(self, text):
108
+ str_embed = self.client.embeddings.create(input = [text], model = self.embed_model).data[0].embedding
109
+ return str_embed
110
+
111
+ def embed_batch(self, texts: List[str]) -> List[np.ndarray]:
112
+ embeddings = self.client.embeddings.create(input=texts, model=self.embed_model).data
113
+ return [np.array(embedding.embedding, dtype=np.float32) for embedding in embeddings]
114
+
115
+ def get_query_embedding(self, query):
116
+ return self.make_embedding(query)
117
+
118
+ def calc_faiss(self, query_embedding, top_k = 100):
119
+ # xq = query_embedding.reshape(-1,1).T.astype('float32')
120
+ # D, I = self.index.search(xq, top_k)
121
+ # return I[0], D[0]
122
+ tmp = self.dataset.search('embed', query_embedding, k=top_k)
123
+ return [tmp.indices, tmp.scores, self.dataset[tmp.indices]]
124
+
125
+ def rank_and_filter(self, query, query_embedding, top_k = 10, top_k_internal = 1000, return_scores=False):
126
+
127
+ if 'Keywords' in self.toggles:
128
+ self.weight_keywords = True
129
+ else:
130
+ self.weight_keywords = False
131
+
132
+ if 'Time' in self.toggles:
133
+ self.weight_date = True
134
+ else:
135
+ self.weight_date = False
136
+
137
+ if 'Citations' in self.toggles:
138
+ self.weight_citation = True
139
+ else:
140
+ self.weight_citation = False
141
+
142
+ topk_indices, similarities, small_corpus = self.calc_faiss(np.array(query_embedding), top_k = top_k_internal)
143
+ similarities = 1/similarities # converting from a distance (less is better) to a similarity (more is better)
144
+
145
+ if self.weight_keywords == True:
146
+
147
+ query_kws = get_keywords(query)
148
+ input_kws = self.query_input_keywords
149
+ query_kws = query_kws + input_kws
150
+ self.query_kws = query_kws
151
+ sub_kws = [small_corpus['keywords'][i] for i in range(top_k_internal)]
152
+ kw_weight = np.zeros((len(topk_indices),)) + 0.1
153
+
154
+ for k in query_kws:
155
+ for i in (range(len(topk_indices))):
156
+ for j in range(len(sub_kws[i])):
157
+ if k.lower() in sub_kws[i][j].lower():
158
+ kw_weight[i] = kw_weight[i] + 0.1
159
+ # print(i, k, sub_kws[i][j])
160
+
161
+ # kw_weight = kw_weight**0.36 / np.amax(kw_weight**0.36)
162
+ kw_weight = kw_weight / np.amax(kw_weight)
163
+ else:
164
+ kw_weight = np.ones((len(topk_indices),))
165
+
166
+ if self.weight_date == True:
167
+ sub_dates = [small_corpus['date'][i] for i in range(top_k_internal)]
168
+ date = datetime.now().date()
169
+ date_diff = np.array([((date - i).days / 365.) for i in sub_dates])
170
+ # age_weight = (1 + np.exp(date_diff/2.1))**(-1) + 0.5
171
+ age_weight = (1 + np.exp(date_diff/0.7))**(-1)
172
+ age_weight = age_weight / np.amax(age_weight)
173
+ else:
174
+ age_weight = np.ones((len(topk_indices),))
175
+
176
+ if self.weight_citation == True:
177
+ # st.write('weighting by citations')
178
+ sub_cites = np.array([small_corpus['cites'][i] for i in range(top_k_internal)])
179
+ temp = sub_cites.copy()
180
+ temp[sub_cites > 300] = 300.
181
+ cite_weight = (1 + np.exp((300-temp)/42.0))**(-1.)
182
+ cite_weight = cite_weight / np.amax(cite_weight)
183
+ else:
184
+ cite_weight = np.ones((len(topk_indices),))
185
+
186
+ similarities = similarities * (kw_weight) * (age_weight) * (cite_weight)
187
+
188
+ filtered_results = [[topk_indices[i], similarities[i]] for i in range(len(similarities))]
189
+ top_results = sorted(filtered_results, key=lambda x: x[1], reverse=True)[:top_k]
190
+
191
+ top_scores = [doc[1] for doc in top_results]
192
+ top_indices = [doc[0] for doc in top_results]
193
+ small_df = self.dataset[top_indices]
194
+
195
+ if return_scores:
196
+ return {doc[0]: doc[1] for doc in top_results}, small_df
197
+
198
+ # Only keep the document IDs
199
+ top_results = [doc[0] for doc in top_results]
200
+ return top_results, small_df
201
+
202
+ def generate_doc(self, query: str):
203
+ prompt = """You are an expert astronomer. Given a scientific query, generate the abstract of an expert-level research paper
204
+ that answers the question. Stick to a maximum length of {} tokens and return just the text of the abstract and conclusion.
205
+ Do not include labels for any section. Use research-specific jargon.""".format(self.max_doclen)
206
+
207
+ messages = [("system",prompt,),("human", query),]
208
+ return self.hyde_client.invoke(messages).content
209
+
210
+ def generate_docs(self, query: str):
211
+ docs = []
212
+ for i in range(self.generate_n):
213
+ docs.append(self.generate_doc(query))
214
+ return docs
215
+
216
+ def embed_docs(self, docs: List[str]):
217
+ return self.embed_batch(docs)
218
+
219
+ def retrieve(self, query, top_k, return_scores = False,
220
+ embed_query=True, max_doclen=250,
221
+ generate_n=1, temperature=0.5,
222
+ rerank_top_k = 250):
223
+
224
+ if max_doclen * generate_n > 8191:
225
+ raise ValueError("Too many tokens. Please reduce max_doclen or generate_n.")
226
+
227
+ query_embedding = self.get_query_embedding(query)
228
+
229
+ if self.hyde == True:
230
+ self.max_doclen = max_doclen
231
+ self.generate_n = generate_n
232
+ self.hyde_client.temperature = temperature
233
+ self.embed_query = embed_query
234
+ docs = self.generate_docs(query)
235
+ # st.expander('Abstract generated with hyde', expanded=False).write(docs)
236
+ doc_embeddings = self.embed_docs(docs)
237
+ if self.embed_query:
238
+ query_emb = self.embed_docs([query])[0]
239
+ doc_embeddings.append(query_emb)
240
+ query_embedding = np.mean(np.array(doc_embeddings), axis = 0)
241
+
242
+ if self.rerank == True:
243
+ top_results, small_df = self.rank_and_filter(query,
244
+ query_embedding,
245
+ rerank_top_k,
246
+ return_scores = False)
247
+ # try:
248
+ docs_for_rerank = [small_df['abstract'][i] for i in range(rerank_top_k)]
249
+ if len(docs_for_rerank) == 0:
250
+ return []
251
+ reranked_results = self.cohere_client.rerank(
252
+ query=query,
253
+ documents=docs_for_rerank,
254
+ model='rerank-english-v3.0',
255
+ top_n=top_k
256
+ )
257
+ final_results = []
258
+ for result in reranked_results.results:
259
+ doc_id = top_results[result.index]
260
+ doc_text = docs_for_rerank[result.index]
261
+ score = float(result.relevance_score)
262
+ final_results.append([doc_id, "", score])
263
+ final_indices = [doc[0] for doc in final_results]
264
+ if return_scores:
265
+ return {result[0]: result[2] for result in final_results}, self.dataset[final_indices]
266
+ return [doc[0] for doc in final_results], self.dataset[final_indices]
267
+ # except:
268
+ # print('heavy load, please wait 10s and try again.')
269
+ else:
270
+ top_results, small_df = self.rank_and_filter(query,
271
+ query_embedding,
272
+ top_k,
273
+ return_scores = return_scores)
274
+
275
+ return top_results, small_df
276
+
277
+ def return_formatted_df(self, top_results, small_df):
278
+
279
+ df = pd.DataFrame(small_df)
280
+ df = df.drop(columns=['umap_x','umap_y','cite_bibcodes','ref_bibcodes'])
281
+ links = ['['+i+'](https://ui.adsabs.harvard.edu/abs/'+i+'/abstract)' for i in small_df['bibcode']]
282
+
283
+ # st.write(top_results[0:10])
284
+ scores = [top_results[i] for i in top_results]
285
+ indices = [i for i in top_results]
286
+ df.insert(1,'ADS Link',links,True)
287
+ df.insert(2,'Relevance',scores,True)
288
+ df.insert(3,'indices',indices,True)
289
+ df = df[['ADS Link','Relevance','date','cites','title','authors','abstract','keywords','ads_id','indices','embed']]
290
+ df.index += 1
291
+ return df
292
+
293
+ arxiv_corpus = load_arxiv_corpus()
294
+ ec = RetrievalSystem()
295
+ print('loaded retrieval system')
296
+
297
+ def Library(papers_df):
298
+ op_docs = ''
299
+ for i in range(len(papers_df)):
300
+ op_docs = op_docs + 'Paper %.0f:' %(i+1) + papers_df['title'][i+1] + '\n' + papers_df['abstract'][i+1] + '\n\n'
301
+
302
+ return op_docs
303
+
304
+ def run_rag_qa(query, papers_df, question_type):
305
+
306
+ loaders = []
307
+
308
+ documents = []
309
+
310
+ for i, row in papers_df.iterrows():
311
+ content = f"Paper {i+1}: {row['title']}\n{row['abstract']}\n\n"
312
+ metadata = {"source": row['ads_id']}
313
+ doc = Document(page_content=content, metadata=metadata)
314
+ documents.append(doc)
315
+
316
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=50, add_start_index=True)
317
+ splits = text_splitter.split_documents(documents)
318
+ vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings, collection_name='retdoc4')
319
+ retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})
320
+
321
+ if question_type == 'Bibliometric':
322
+ template = bibliometric_prompt
323
+ elif question_type == 'Single-paper':
324
+ template = single_paper_prompt
325
+ elif question_type == 'Broad but nuanced':
326
+ template = deep_knowledge_prompt
327
+ else:
328
+ template = regular_prompt
329
+ prompt = PromptTemplate.from_template(template)
330
+
331
+ def format_docs(docs):
332
+ return "\n\n".join(doc.page_content for doc in docs)
333
+
334
+ rag_chain_from_docs = (
335
+ RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
336
+ | prompt
337
+ | gen_llm
338
+ | StrOutputParser()
339
+ )
340
+
341
+ rag_chain_with_source = RunnableParallel(
342
+ {"context": retriever, "question": RunnablePassthrough()}
343
+ ).assign(answer=rag_chain_from_docs)
344
+ rag_answer = rag_chain_with_source.invoke(query, )
345
+ vectorstore.delete_collection()
346
+
347
+ # except:
348
+ # st.subheader('heavy load! please wait 10 seconds and try again.')
349
+
350
+ return rag_answer
351
+
352
+ def guess_question_type(query: str):
353
+
354
+ gen_client = openai_llm(temperature=0,model_name='gpt-4o-mini', openai_api_key = openai_key)
355
+ messages = [("system",question_categorization_prompt,),("human", query),]
356
+ return gen_client.invoke(messages).content
357
+
358
+ def log_to_gist(strings):
359
+ # Adding query logs to prevent and account for possible malicious use.
360
+ # Logs will be deleted periodically if not needed.
361
+ github_token = os.environ['github_token']
362
+ gist_id = os.environ['gist_id']
363
+ timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
364
+ content = f"\n{timestamp}: {' '.join(strings)}\n"
365
+ headers = {'Authorization': f'token {github_token}','Accept': 'application/vnd.github.v3+json'}
366
+ response = requests.get(f'https://api.github.com/gists/{gist_id}', headers=headers)
367
+ if response.status_code == 200:
368
+ existing_content = response.json()['files']['log.txt']['content']
369
+ content = existing_content + content
370
+ data = {"description": "Logged Strings","public": False,"files": {"log.txt": {"content": content}}}
371
+ headers = {'Authorization': f'token {github_token}','Accept': 'application/vnd.github.v3+json'}
372
+ response = requests.patch(f'https://api.github.com/gists/{gist_id}', headers=headers, data=json.dumps(data)) # Update existing gist
373
+ return
374
+
375
+ class OverallConsensusEvaluation(BaseModel):
376
+ rewritten_statement: str = Field(
377
+ ...,
378
+ description="The query rewritten as a statement if it was initially a question"
379
+ )
380
+ consensus: Literal[
381
+ "Strong Agreement Between Abstracts and Query",
382
+ "Moderate Agreement Between Abstracts and Query",
383
+ "Weak Agreement Between Abstracts and Query",
384
+ "No Clear Agreement/Disagreement Between Abstracts and Query",
385
+ "Weak Disagreement Between Abstracts and Query",
386
+ "Moderate Disagreement Between Abstracts and Query",
387
+ "Strong Disagreement Between Abstracts and Query"
388
+ ] = Field(
389
+ ...,
390
+ description="The overall level of consensus between the rewritten statement and the abstracts"
391
+ )
392
+ explanation: str = Field(
393
+ ...,
394
+ description="A detailed explanation of the consensus evaluation (maximum six sentences)"
395
+ )
396
+ relevance_score: float = Field(
397
+ ...,
398
+ description="A score from 0 to 1 indicating how relevant the abstracts are to the query overall",
399
+ ge=0,
400
+ le=1
401
+ )
402
+
403
+ def evaluate_overall_consensus(query: str, abstracts: List[str]) -> OverallConsensusEvaluation:
404
+ prompt = f"""
405
+ Query: {query}
406
+ You will be provided with {len(abstracts)} scientific abstracts. Your task is to do the following:
407
+ 1. If the provided query is a question, rewrite it as a statement. This statement does not have to be true. Output this as 'Rewritten Statement:'.
408
+ 2. Evaluate the overall consensus between the rewritten statement and the abstracts using one of the following levels:
409
+ - Strong Agreement Between Abstracts and Query
410
+ - Moderate Agreement Between Abstracts and Query
411
+ - Weak Agreement Between Abstracts and Query
412
+ - No Clear Agreement/Disagreement Between Abstracts and Query
413
+ - Weak Disagreement Between Abstracts and Query
414
+ - Moderate Disagreement Between Abstracts and Query
415
+ - Strong Disagreement Between Abstracts and Query
416
+ Output this as 'Consensus:'
417
+ 3. Provide a detailed explanation of your consensus evaluation in maximum six sentences. Output this as 'Explanation:'
418
+ 4. Assign a relevance score as a float between 0 to 1, where:
419
+ - 1.0: Perfect match in content and quality
420
+ - 0.8-0.9: Excellent, with minor differences
421
+ - 0.6-0.7: Good, captures main points but misses some details
422
+ - 0.4-0.5: Fair, partially relevant but significant gaps
423
+ - 0.2-0.3: Poor, major inaccuracies or omissions
424
+ - 0.0-0.1: Completely irrelevant or incorrect
425
+ Output this as 'Relevance Score:'
426
+ Here are the abstracts:
427
+ {' '.join([f"Abstract {i+1}: {abstract}" for i, abstract in enumerate(abstracts)])}
428
+ Provide your evaluation in the structured format described above.
429
+ """
430
+
431
+ response = consensus_client.chat.completions.create(
432
+ model="gpt-4o-mini", # used to be "gpt-4",
433
+ response_model=OverallConsensusEvaluation,
434
+ messages=[
435
+ {"role": "system", "content": """You are an assistant with expertise in astrophysics for question-answering tasks.
436
+ Evaluate the overall consensus of the retrieved scientific abstracts in relation to a given query.
437
+ If you don't know the answer, just say that you don't know.
438
+ Use six sentences maximum and keep the answer concise."""},
439
+ {"role": "user", "content": prompt}
440
+ ],
441
+ temperature=0
442
+ )
443
+
444
+ return response
445
+
446
+ def calc_outlier_flag(papers_df, top_k, cutoff_adjust = 0.1):
447
+
448
+ cut_dist = np.load('pfdr_arxiv_cutoff_distances.npy') - cutoff_adjust
449
+ pts = np.array(papers_df['embed'].tolist())
450
+ centroid = np.mean(pts,0)
451
+ dists = np.sqrt(np.sum((pts-centroid)**2,1))
452
+ outlier_flag = (dists > cut_dist[top_k-1])
453
+
454
+ return outlier_flag
455
+
456
+ def make_embedding_plot(papers_df, top_k, consensus_answer, arxiv_corpus=arxiv_corpus):
457
+
458
+ plt_indices = np.array(papers_df['indices'].tolist())
459
+
460
+ xax = np.array(arxiv_corpus['umap_x'])
461
+ yax = np.array(arxiv_corpus['umap_y'])
462
+
463
+ outlier_flag = calc_outlier_flag(papers_df, top_k, cutoff_adjust=0.25)
464
+ alphas = np.ones((len(plt_indices),)) * 0.9
465
+ alphas[outlier_flag] = 0.5
466
+
467
+ fig = plt.figure(figsize=(9*1.8,12*1.8))
468
+ plt.scatter(xax,yax, s=1, alpha=0.01, c='k')
469
+
470
+ clkws = np.load('kw_tags.npz')
471
+ all_x, all_y, all_topics, repeat_flag = clkws['all_x'], clkws['all_y'], clkws['all_topics'], clkws['repeat_flag']
472
+ for i in range(len(all_topics)):
473
+ if repeat_flag[i] == False:
474
+ plt.text(all_x[i], all_y[i], all_topics[i],fontsize=9,ha="center", va="center",
475
+ bbox=dict(facecolor='white', edgecolor='black', boxstyle='round,pad=0.3',alpha=0.81))
476
+ plt.scatter(xax[plt_indices], yax[plt_indices], s=300*alphas**2, alpha=alphas, c='w',zorder=1000)
477
+ plt.scatter(xax[plt_indices], yax[plt_indices], s=100*alphas**2, alpha=alphas, c='dodgerblue',zorder=1001)
478
+ # plt.scatter(xax[plt_indices][outlier_flag], yax[plt_indices][outlier_flag], s=100, alpha=1., c='firebrick')
479
+ plt.axis([0,20,-4.2,18])
480
+ plt.axis('off')
481
+ return fig
482
+
483
+
484
+ def getsmallans(query, df):
485
+
486
+ allcontent = dr_smallans_prompt
487
+
488
+ smallauth = ''
489
+ linkstr = ''
490
+ for i, row in df.iterrows():
491
+ # content = f"Paper {i+1}: {row['title'].replace('\n',' ')}\n{row['abstract'].replace('\n',' ')}\n\n"
492
+ content = f"Paper ({row['authors'][0].split(',')[0]} et al. {row['date'].year}): {row['title']}\n{row['abstract']}\n\n"
493
+ smallauth = smallauth + f"({row['authors'][0].split(',')[0]} et al. {row['date'].year}) "
494
+ linkstr = linkstr + f"[{row['authors'][0].split(',')[0]} et al. {row['date'].year}](" + row['ADS Link'].split('](')[1] + ' \n\n'
495
+ allcontent = allcontent + content
496
+
497
+ # allcontent = allcontent + '\n Question: '+query
498
+
499
+ gen_client = openai_llm(temperature=0,model_name='gpt-4o-mini', openai_api_key = openai_key)
500
+
501
+ messages = [("system",allcontent,),("human", query),]
502
+ smallans = gen_client.invoke(messages).content
503
+
504
+ tmplnk = linkstr.split(' \n\n')
505
+ linkdict = {}
506
+ for i in range(len(tmplnk)-1):
507
+ linkdict[tmplnk[i].split('](')[0][1:]] = tmplnk[i]
508
+
509
+ for key in linkdict.keys():
510
+ try:
511
+ smallans = smallans.replace(key, linkdict[key])
512
+ key2 = key[0:-4]+'('+key[-4:]+')'
513
+ smallans = smallans.replace(key2, linkdict[key])
514
+ except:
515
+ print('key not found', key)
516
+
517
+ return smallans, smallauth, linkstr
518
+
519
+ def compileinfo(query, atom_qns, atom_qn_ans, atom_qn_strs):
520
+
521
+ tmp = dr_compileinfo_prompt
522
+ links = ''
523
+ for i in range(len(atom_qn_ans)):
524
+ tmp = tmp + atom_qns[i] + '\n\n' + atom_qn_ans[i] + '\n\n'
525
+ links = links + atom_qn_strs[i] + '\n\n'
526
+
527
+ gen_client = openai_llm(temperature=0,model_name='gpt-4o-mini', openai_api_key = openai_key)
528
+
529
+ messages = [("system",tmp,),("human", query),]
530
+ smallans = gen_client.invoke(messages).content
531
+ return smallans, links
532
+
533
+ def deep_research(question, top_k, ec):
534
+
535
+ full_answer = '## ' + question
536
+
537
+ gen_client = openai_llm(temperature=0,model_name='gpt-4o-mini', openai_api_key = openai_key)
538
+ messages = [("system",df_atomic_prompt,),("human", question),]
539
+ rscope_text = gen_client.invoke(messages).content
540
+
541
+ full_answer = full_answer +' \n'+ rscope_text
542
+
543
+ rscope_messages = [("system","""In the given text, what are the main atomic questions being asked? Please answer as a concise list.""",),("human", rscope_text),]
544
+ rscope_qns = gen_client.invoke(rscope_messages).content
545
+
546
+ atom_qns = []
547
+
548
+ temp = rscope_qns.split('\n')
549
+ for i in temp:
550
+ if i != '':
551
+ atom_qns.append(i)
552
+
553
+ atom_qn_dfs = []
554
+ atom_qn_ans = []
555
+ atom_qn_strs = []
556
+ for i in range(len(atom_qns)):
557
+ rs, small_df = ec.retrieve(atom_qns[i], top_k = top_k, return_scores=True)
558
+ formatted_df = ec.return_formatted_df(rs, small_df)
559
+ atom_qn_dfs.append(formatted_df)
560
+ smallans, smallauth, linkstr = getsmallans(atom_qns[i], atom_qn_dfs[i])
561
+
562
+ atom_qn_ans.append(smallans)
563
+ atom_qn_strs.append(linkstr)
564
+ full_answer = full_answer +' \n### '+atom_qns[i]
565
+ full_answer = full_answer +' \n'+smallans
566
+
567
+ finalans, finallinks = compileinfo(question, atom_qns, atom_qn_ans, atom_qn_strs)
568
+ full_answer = full_answer +' \n'+'### Summary:\n'+finalans
569
+
570
+ full_df = pd.concat(atom_qn_dfs)
571
+
572
+ rag_answer = {}
573
+ rag_answer['answer'] = full_answer
574
+
575
+ return full_df, rag_answer
576
+
577
+ def run_pathfinder(query, top_k, extra_keywords, toggles, prompt_type, rag_type, ec=ec, progress=gr.Progress()):
578
+
579
+ yield None, None, None, None, None
580
+
581
+ search_text_list = ['rooting around in the paper pile...','looking for clarity...','scanning the event horizon...','peering into the abyss...','potatoes power this ongoing search...']
582
+ gen_text_list = ['making the LLM talk to the papers...','invoking arcane rituals...','gone to library, please wait...','is there really an answer to this...']
583
+
584
+ log_to_gist(['[mod flag: '+str(check_mod(query))+']', query])
585
+ if check_mod(query) == False:
586
+
587
+ input_keywords = [kw.strip() for kw in extra_keywords.split(',')] if extra_keywords else []
588
+ query_keywords = get_keywords(query)
589
+ ec.query_input_keywords = input_keywords+query_keywords
590
+ ec.toggles = toggles
591
+ if rag_type == "Semantic Search":
592
+ ec.hyde = False
593
+ ec.rerank = False
594
+ elif rag_type == "Semantic + HyDE":
595
+ ec.hyde = True
596
+ ec.rerank = False
597
+ elif rag_type == "Semantic + CoHERE":
598
+ ec.hyde = False
599
+ ec.rerank = True
600
+ elif rag_type == "Semantic + HyDE + CoHERE":
601
+ ec.hyde = True
602
+ ec.rerank = True
603
+
604
+ if prompt_type == "Deep Research (BETA)":
605
+ formatted_df, rag_answer = deep_research(query, top_k = top_k, ec=ec)
606
+ yield formatted_df, rag_answer['answer'], None, None, None
607
+
608
+ else:
609
+ # progress(0.2, desc=search_text_list[np.random.choice(len(search_text_list))])
610
+ rs, small_df = ec.retrieve(query, top_k = top_k, return_scores=True)
611
+ formatted_df = ec.return_formatted_df(rs, small_df)
612
+ yield formatted_df, None, None, None, None
613
+
614
+ # progress(0.4, desc=gen_text_list[np.random.choice(len(gen_text_list))])
615
+ rag_answer = run_rag_qa(query, formatted_df, prompt_type)
616
+ yield formatted_df, rag_answer['answer'], None, None, None
617
+
618
+ # progress(0.6, desc="Generating consensus")
619
+ consensus_answer = evaluate_overall_consensus(query, [formatted_df['abstract'][i+1] for i in range(len(formatted_df))])
620
+ consensus = '## Consensus \n'+consensus_answer.consensus + '\n\n'+consensus_answer.explanation + '\n\n > Relevance of retrieved papers to answer: %.1f' %consensus_answer.relevance_score
621
+ yield formatted_df, rag_answer['answer'], consensus, None, None
622
+
623
+ # progress(0.8, desc="Analyzing question type")
624
+ question_type_gen = guess_question_type(query)
625
+ if '<categorization>' in question_type_gen:
626
+ question_type_gen = question_type_gen.split('<categorization>')[1]
627
+ if '</categorization>' in question_type_gen:
628
+ question_type_gen = question_type_gen.split('</categorization>')[0]
629
+ question_type_gen = question_type_gen.replace('\n',' \n')
630
+ qn_type = question_type_gen
631
+ yield formatted_df, rag_answer['answer'], consensus, qn_type, None
632
+
633
+ # progress(1.0, desc="Visualizing embeddings")
634
+ fig = make_embedding_plot(formatted_df, top_k, consensus_answer)
635
+
636
+ yield formatted_df, rag_answer['answer'], consensus, qn_type, fig
637
+
638
+ def create_interface():
639
+ custom_css = """
640
+ #custom-slider-* {
641
+ background-color: #ffffff;
642
+ }
643
+ """
644
+
645
+ with gr.Blocks(css=custom_css) as demo:
646
+
647
+ with gr.Tabs():
648
+ # with gr.Tab("What is Pathfinder?"):
649
+ # gr.Markdown(pathfinder_text)
650
+ with gr.Tab("pathfinder"):
651
+ with gr.Accordion("What is Pathfinder? / How do I use it?", open=False):
652
+ gr.Markdown(pathfinder_text)
653
+ img2 = gr.Image("local_files/galaxy_worldmap_kiyer-min.png")
654
+
655
+ with gr.Row():
656
+ query = gr.Textbox(label="Ask me anything")
657
+ with gr.Row():
658
+ with gr.Column(scale=1, min_width=300):
659
+ top_k = gr.Slider(1, 30, step=1, value=10, label="top-k", info="Number of papers to retrieve")
660
+ keywords = gr.Textbox(label="Optional Keywords (comma-separated)",value="")
661
+ toggles = gr.CheckboxGroup(["Keywords", "Time", "Citations"], label="Weight by", info="weighting retrieved papers",value=['Keywords'])
662
+ prompt_type = gr.Radio(choices=["Single-paper", "Multi-paper", "Bibliometric", "Broad but nuanced","Deep Research (BETA)"], label="Prompt Specialization", value='Multi-paper')
663
+ rag_type = gr.Radio(choices=["Semantic Search", "Semantic + HyDE", "Semantic + CoHERE", "Semantic + HyDE + CoHERE"], label="RAG Method",value='Semantic + HyDE + CoHERE')
664
+ with gr.Column(scale=2, min_width=300):
665
+ img1 = gr.Image("local_files/pathfinder_logo.png")
666
+ btn = gr.Button("Run pfdr!")
667
+ # search_results_state = gr.State([])
668
+ ret_papers = gr.Dataframe(label='top-k retrieved papers', datatype='markdown')
669
+ search_results_state = gr.Markdown(label='Generated Answer')
670
+ qntype = gr.Markdown(label='Question type suggestion')
671
+ conc = gr.Markdown(label='Consensus')
672
+ plot = gr.Plot(label='top-k in embedding space')
673
+
674
+ inputs = [query, top_k, keywords, toggles, prompt_type, rag_type]
675
+ outputs = [ret_papers, search_results_state, qntype, conc, plot]
676
+ btn.click(fn=run_pathfinder, inputs=inputs, outputs=outputs)
677
+
678
+ return demo
679
+
680
+
681
+ if __name__ == "__main__":
682
+
683
+ pathfinder = create_interface()
684
+ pathfinder.launch()
.ipynb_checkpoints/prompts-checkpoint.py ADDED
@@ -0,0 +1,328 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ react_prompt = """You are an expert astronomer and cosmologist.
2
+ Answer the following question as best you can using information from the library, but speaking in a concise and factual manner.
3
+ If you can not come up with an answer, say you do not know.
4
+ Try to break the question down into smaller steps and solve it in a logical manner.
5
+
6
+ You have access to the following tools:
7
+
8
+ {tools}
9
+
10
+ Use the following format:
11
+
12
+ Question: the input question you must answer
13
+ Thought: you should always think about what to do
14
+ Action: the action to take, should be one of [{tool_names}]
15
+ Action Input: the input to the action
16
+ Observation: the result of the action
17
+ ... (this Thought/Action/Action Input/Observation can repeat N times)
18
+ Thought: I now know the final answer
19
+ Final Answer: the final answer to the original input question. provide information about how you arrived at the answer, and any nuances or uncertainties the reader should be aware of
20
+
21
+ Begin! Remember to speak in a pedagogical and factual manner."
22
+
23
+ Question: {input}
24
+ Thought:{agent_scratchpad}"""
25
+
26
+ regular_prompt = """You are an expert astronomer and cosmologist.
27
+ Answer the following question as best you can using information from the library, but speaking in a concise and factual manner.
28
+ If you can not come up with an answer, say you do not know.
29
+ Try to break the question down into smaller steps and solve it in a logical manner.
30
+
31
+ Provide information about how you arrived at the answer, and any nuances or uncertainties the reader should be aware of.
32
+
33
+ Begin! Remember to speak in a pedagogical and factual manner."
34
+
35
+ Relevant documents:{context}
36
+
37
+ Question: {question}
38
+ Answer:"""
39
+
40
+ bibliometric_prompt = """You are an AI assistant with expertise in astronomy and astrophysics literature. Your task is to assist with relevant bibliometric information in response to a user question. The user question may consist of identifying key papers, authors, or trends in a specific area of astronomical research.
41
+
42
+ Depending on what the user wants, direct them to consult the NASA Astrophysics Data System (ADS) at https://ui.adsabs.harvard.edu/. Provide them with the recommended ADS query depending on their question.
43
+
44
+ Here's a more detailed guide on how to use NASA ADS for various types of queries:
45
+
46
+ Basic topic search: Enter keywords in the search bar, e.g., "exoplanets". Use quotation marks for exact phrases, e.g., "dark energy”
47
+ Author search: Use the syntax author:"Last Name, First Name", e.g., author:"Hawking, S". For papers by multiple authors, use AND, e.g., author:"Hawking, S" AND author:"Ellis, G"
48
+ Date range: Use year:YYYY-YYYY, e.g., year:2010-2020. For papers since a certain year, use year:YYYY-, e.g., year:2015-
49
+ 4.Combining search terms: Use AND, OR, NOT operators, e.g., "black holes" AND (author:"Hawking, S" OR author:"Penrose, R")
50
+ Filtering results: Use the left sidebar to filter by publication year, article type, or astronomy database
51
+ Sorting results: Use the "Sort" dropdown menu to order by options like citation count, publication date, or relevance
52
+ Advanced searches: Click on the "Search" dropdown menu and select "Classic Form" for field-specific searchesUse bibcode:YYYY for a specific journal/year, e.g., bibcode:2020ApJ to find all Astrophysical Journal papers from 2020
53
+ Finding review articles: Wrap the query in the reviews() operator (e.g. reviews(“dark energy”))
54
+ Excluding preprints: Add NOT doctype:"eprint" to your search
55
+ Citation metrics: Click on the citation count of a paper to see its citation history and who has cited it
56
+
57
+ Some examples:
58
+
59
+ Example 1:
60
+ “How many papers published in 2022 used data from MAST missions?”
61
+ Your response should be: year:2022 data:"MAST"
62
+
63
+ Example 2:
64
+ “What are the most cited papers on spiral galaxy halos measured in X-rays, with publication date from 2010 to 2023?
65
+ Your response should be: "spiral galaxy halos" AND "x-ray" year:2010-2024
66
+
67
+ Example 3:
68
+ “Can you list 3 papers published by “< name>” as first author?”
69
+ Your response should be: author: “^X”
70
+
71
+ Example 4:
72
+ “Based on papers with “<name>” as an author or co-author, can you suggest the five most recent astro-ph papers that would be relevant?”
73
+ Your response should be:
74
+
75
+ Remember to advise users that while these examples cover many common scenarios, NASA ADS has many more advanced features that can be explored through its documentation.
76
+
77
+ Relevant documents:{context}
78
+ Question: {question}
79
+
80
+ Response:"""
81
+
82
+ single_paper_prompt = """You are an astronomer with access to a vast database of astronomical facts and figures. Your task is to provide a concise, accurate answer to the following specific factual question about astronomy or astrophysics.
83
+ Provide the requested information clearly and directly. If relevant, include the source of your information or any recent updates to this fact. If there's any uncertainty or variation in the accepted value, briefly explain why.
84
+ If the question can't be answered with a single fact, provide a short, focused explanation. Always prioritize accuracy over speculation.
85
+ Relevant documents:{context}
86
+ Question: {question}
87
+ Response:"""
88
+
89
+ deep_knowledge_prompt = """You are an expert astronomer with deep knowledge across various subfields of astronomy and astrophysics. Your task is to provide a comprehensive and nuanced answer to the following question, which involves an unresolved topic or requires broad, common-sense understanding.
90
+ Consider multiple perspectives and current debates in the field. Explain any uncertainties or ongoing research. If relevant, mention how this topic connects to other areas of astronomy.
91
+ Provide your response in a clear, pedagogical manner, breaking down complex concepts for easier understanding. If appropriate, suggest areas where further research might be needed.
92
+ After formulating your initial response, take a moment to reflect on your answer. Consider:
93
+ 1. Have you addressed all aspects of the question?
94
+ 2. Are there any potential biases or assumptions in your explanation?
95
+ 3. Is your explanation clear and accessible to someone with a general science background?
96
+ 4. Have you adequately conveyed the uncertainties or debates surrounding this topic?
97
+ Based on this reflection, refine your answer as needed.
98
+ Remember, while you have extensive knowledge, it's okay to acknowledge the limits of current scientific understanding. If parts of the question cannot be answered definitively, explain why.
99
+ Relevant documents:{context}
100
+
101
+ Question: {question}
102
+
103
+ Initial Response:
104
+ [Your initial response here]
105
+
106
+ Reflection and Refinement:
107
+ [Your reflections and any refinements to your answer here]
108
+
109
+ Final Response:
110
+ [Your final, refined answer here]"""
111
+
112
+ question_categorization_prompt = """You are an expert astrophysicist and computer scientist specializing in linguistics and semantics. Your task is to categorize a given query into one of the following categories:
113
+
114
+ 1. Summarization
115
+ 2. Single-paper factual
116
+ 3. Multi-paper factual
117
+ 4. Named entity recognition
118
+ 5. Jargon-specific questions / overloaded words
119
+ 6. Time-sensitive
120
+ 7. Consensus evaluation
121
+ 8. What-ifs and counterfactuals
122
+ 9. Compositional
123
+
124
+ Analyze the query carefully, considering its content, structure, and implications. Then, determine which of the above categories best fits the query.
125
+
126
+ In your analysis, consider the following:
127
+ - Does the query ask for a well-known datapoint or mechanism?
128
+ - Can it be answered by a single paper or does it require multiple sources?
129
+ - Does it involve proper nouns or specific scientific terms?
130
+ - Is it time-dependent or likely to change in the near future?
131
+ - Does it require evaluating consensus across multiple sources?
132
+ - Is it a hypothetical or counterfactual question?
133
+ - Does it need to be broken down into sub-queries (i.e. compositional)?
134
+
135
+ After your analysis, categorize the query into one of the nine categories listed above.
136
+
137
+ Provide a brief explanation for your categorization, highlighting the key aspects of the query that led to your decision.
138
+
139
+ Present your final answer in the following format:
140
+
141
+ <categorization>
142
+ Category: [Selected category]
143
+ Explanation: [Your explanation for the categorization]
144
+ </categorization>"""
145
+
146
+
147
+ pathfinder_text = """# Welcome to Pathfinder
148
+
149
+ ## Discover the Universe Through AI-Powered Astronomy ReSearch
150
+
151
+ ### What is Pathfinder?
152
+
153
+ Pathfinder (https://pfdr.app; [Iyer et al 2024 ApJS 275 38](https://iopscience.iop.org/article/10.3847/1538-4365/ad7c43)) harnesses the power of modern large language models (LLMs) in combination with papers on the [arXiv](https://arxiv.org/) and [ADS](https://ui.adsabs.harvard.edu/) to navigate the vast expanse of astronomy literature.
154
+ Our tool empowers researchers, students, and astronomy enthusiasts to get started on their journeys to find answers to complex research questions quickly and efficiently.
155
+
156
+ To use the old streamlit pathfinder (with the ReAct agent), you can use the [pfdr streamlit mirror](https://huggingface.co/spaces/kiyer/pathfinder_v3/).
157
+
158
+ This is not meant to be a replacement to existing tools like the [ADS](https://ui.adsabs.harvard.edu/), [arxivsorter](https://www.arxivsorter.org/), semantic search or google scholar, but rather a supplement to find papers that otherwise might be missed during a literature survey. It is trained on astro-ph papers up to July 2024.
159
+
160
+ ### How to Use Pathfinder
161
+
162
+ You can use pathfinder to find papers of interest with natural-language questions, and generate basic answers to questions using the retrieved papers. Try asking it questions like
163
+
164
+ - What is the value of the Hubble Constant?
165
+ - Are there open source radiative transfer codes for planetary atmospheres?
166
+ - Can I predict a galaxy spectrum from an image cutout? Please reply in Hindi.
167
+ - How would galaxy evolution differ in a universe with no dark matter?
168
+
169
+ **👈 Use the sidebar to tweak the search parameters to get better results**. Changing the number of retrieved papers (**top-k**), weighting by keywords, time, or citations, or changing the prompt type might help better refine the paper search and synthesized answers for your specific question.
170
+
171
+ 1. **Enter Your Query**: Type your astronomy question in the search bar & hit `run pathfinder`.
172
+ 2. **Review Results**: Pathfinder will analyze relevant literature and present you with a concise answer.
173
+ 3. **Explore Further**: Click on provided links to delve deeper into the source material on ADS.
174
+ 4. **Refine Your Search**: Use our advanced filters to narrow down results by date, author, or topic.
175
+ 5. **Download results:** You can download the results of your query as a json file.
176
+
177
+ ### Why Use Pathfinder?
178
+
179
+ - **Time-Saving**: Get started finding answers that would take hours of manual research.
180
+ - **Comprehensive**: Access information from papers across a large database of astronomy literature.
181
+ - **User-Friendly**: Intuitive interface designed for researchers at all levels.
182
+ - **Constantly Updated**: Our database is regularly refreshed with the latest publications.
183
+
184
+ ### Learn More
185
+
186
+ - Read our paper on [arXiv](https://arxiv.org/abs/2408.01556) to understand the technology behind Pathfinder.
187
+ - Discover how Pathfinder was developed in collaboration with [UniverseTBD](https://www.universetbd.org) on its mission is to democratise science for everyone, and [JSALT](https://www.clsp.jhu.edu/2024-jelinek-summer-workshop-on-speech-and-language-technology/).
188
+
189
+ ---
190
+
191
+ ### Copyright and Terms of Use
192
+
193
+ © 2024 Pathfinder. All rights reserved.
194
+
195
+ Pathfinder is provided "as is" without warranty of any kind. By using this service, you agree to our [Terms of Service] and [Privacy Policy].
196
+
197
+ ### Contact Us
198
+
199
+ Have questions or feedback? We'd love to hear from you!
200
+ - Email: [email protected]
201
+ - Twitter: [@universe_tbd](https://twitter.com/universe_tbd)
202
+ - Huggingface: [https://huggingface.co/spaces/kiyer/pathfinder/](https://huggingface.co/spaces/kiyer/pathfinder/)
203
+
204
+ ---
205
+
206
+ *Empowering astronomical discoveries, one query at a time.*
207
+ """
208
+
209
+ dr_smallans_prompt = """You are an expert astronomer with deep knowledge across various subfields of astronomy and astrophysics.
210
+ Given a student's question and some relevant papers below, your task is to provide a concise, specific answer to the question.
211
+ The retrieved literature has a paper identifier at the very beginning consisting of the author and year, followed by the title and abstract. Please use this to cite the paper when drafting your response.
212
+ Your task is to make sure to contextualize and present the knowledge in the retrieved papers and use it to answer the question, being as specific as possible, mentioning datasets, methods and results from the retrieved literature.
213
+ Try to include as many papers as possible in your answer.
214
+ Remember, while you have extensive knowledge, it's okay to acknowledge the limits of current scientific understanding.
215
+ If parts of the question cannot be answered definitively, explain why.
216
+ Provide your response in the style of text in the Astrophysical Journal, and limit your answer to 1-2 paragraphs.
217
+
218
+ Relevant papers: \n
219
+ """
220
+
221
+ dr_compileinfo_prompt = """You are an expert astronomer with deep knowledge across various subfields of astronomy and astrophysics.
222
+ Given a student's question and collected relevant information to different aspects of it below, your task is to provide a concise, specific answer to the question.
223
+ The provided information is structured as a series of logical steps and sub-questions pertaining to the main question, please use that in constructing your response.
224
+ The retrieved literature has a paper identifier at the very beginning consisting of the author and year, followed by the title and abstract. Please use this to cite the paper when drafting your response.
225
+ Your task is to make sure to contextualize and present the knowledge in the retrieved papers and use it to answer the question, being as specific as possible, mentioning datasets, methods and results from the retrieved literature.
226
+ Try to include as many papers as possible in your answer.
227
+ Remember, while you have extensive knowledge, it's okay to acknowledge the limits of current scientific understanding.
228
+ Consider multiple perspectives and current debates in the field. Explain any uncertainties or ongoing research. If relevant, mention how this topic connects to other areas of astronomy.
229
+ If parts of the question cannot be answered definitively using the provided information, explain why.
230
+ Provide your response in the style of text in the Astrophysical Journal, and limit your answer to a small section.
231
+
232
+ Relevant information: \n
233
+ """
234
+
235
+ df_atomic_prompt = """You are an expert in astrophysics tasked with breaking down complex research questions into their fundamental components. Your goal is to decompose questions in a way that allows for systematic information retrieval and reasoning.
236
+
237
+ Given a complex astrophysical research question, follow these steps:
238
+
239
+ 1. IDENTIFY CORE CONCEPTS
240
+ First, identify the key astrophysical concepts, objects, or phenomena mentioned in the question. For each concept:
241
+ - List the concept and its basic definition
242
+ - Note any specific parameters or conditions mentioned
243
+ - Identify any implicit related concepts that would be necessary to understand
244
+
245
+ 2. BREAK DOWN INTO ATOMIC QUESTIONS
246
+ Decompose the main question into a series of smaller, atomic questions that:
247
+ - Can be answered independently
248
+ - Build upon each other logically
249
+ - Cover all necessary background knowledge
250
+ - Address specific measurable or observable aspects
251
+ - Include questions about relationships between concepts
252
+
253
+ 3. ESTABLISH DEPENDENCIES
254
+ Create a dependency tree showing:
255
+ - Which atomic questions must be answered first
256
+ - How the answers to each question feed into understanding others
257
+ - What background knowledge is required for each question
258
+
259
+ 4. SPECIFY REQUIRED INFORMATION
260
+ For each atomic question, list:
261
+ - The type of data or information needed (observational, theoretical, mathematical)
262
+ - Relevant units, scales, or ranges
263
+ - Any specific conditions or constraints
264
+ - Potential sources or methods for finding this information
265
+
266
+ 5. OUTPUT FORMAT
267
+ Provide your analysis in the following structure:
268
+
269
+ ORIGINAL QUESTION: [Full text of the original question]
270
+
271
+ CORE CONCEPTS:
272
+ - [Concept 1]
273
+ - Definition:
274
+ - Related parameters:
275
+ - Implicit dependencies:
276
+ [Repeat for each core concept]
277
+
278
+ ATOMIC QUESTIONS:
279
+ 1. [Question 1]
280
+ - Required information:
281
+ - Dependencies:
282
+ - Expected output:
283
+ [Repeat for each atomic question]
284
+
285
+ REASONING PATH:
286
+ [Explain how the atomic questions build up to answer the original question]
287
+
288
+ Example:
289
+ ORIGINAL QUESTION: "How does the mass of a supermassive black hole affect the rate of star formation in its host galaxy's bulge?"
290
+
291
+ CORE CONCEPTS:
292
+ - Supermassive Black Holes (SMBH)
293
+ - Definition: Extremely massive black holes found at galaxy centers
294
+ - Related parameters: Mass range (10^6 to 10^10 solar masses)
295
+ - Implicit dependencies: Accretion processes, gravitational influence radius
296
+
297
+ - Galaxy Bulge
298
+ - Definition: Central concentration of stars in a galaxy
299
+ - Related parameters: Size, stellar mass, density
300
+ - Implicit dependencies: Stellar population, gas content
301
+
302
+ - Star Formation Rate (SFR)
303
+ - Definition: Rate at which gas is converted into new stars
304
+ - Related parameters: Solar masses per year
305
+ - Implicit dependencies: Gas density, temperature, metallicity
306
+
307
+ ATOMIC QUESTIONS:
308
+ 1. What is the typical sphere of influence of an SMBH as a function of its mass?
309
+ - Required information: Mathematical relation between SMBH mass and gravitational influence
310
+ - Dependencies: None
311
+ - Expected output: Radius of influence formula
312
+
313
+ 2. How does SMBH mass correlate with bulge gas temperature?
314
+ - Required information: Observational data on X-ray gas temperatures
315
+ - Dependencies: Question 1
316
+ - Expected output: Temperature-mass relation
317
+
318
+ [Continue with additional atomic questions...]
319
+
320
+ REASONING PATH:
321
+ To understand the relationship between SMBH mass and star formation, we first need to establish the SMBH's sphere of influence. This allows us to determine how it heats and disturbs the surrounding gas, which directly affects star formation conditions. By combining these elements, we can establish the causal chain from SMBH mass to star formation rate.
322
+
323
+ Remember to:
324
+ - Maintain scientific precision in terminology
325
+ - Consider both direct and indirect effects
326
+ - Account for observational limitations
327
+ - Include relevant scales and units
328
+ - Consider potential confounding variables"""
app_gradio.py CHANGED
@@ -535,7 +535,7 @@ def deep_research(question, top_k, ec):
535
  full_answer = '## ' + question
536
 
537
  gen_client = openai_llm(temperature=0,model_name='gpt-4o-mini', openai_api_key = openai_key)
538
- messages = [("system",prompt_qdec2,),("human", question),]
539
  rscope_text = gen_client.invoke(messages).content
540
 
541
  full_answer = full_answer +' \n'+ rscope_text
 
535
  full_answer = '## ' + question
536
 
537
  gen_client = openai_llm(temperature=0,model_name='gpt-4o-mini', openai_api_key = openai_key)
538
+ messages = [("system",df_atomic_prompt,),("human", question),]
539
  rscope_text = gen_client.invoke(messages).content
540
 
541
  full_answer = full_answer +' \n'+ rscope_text
prompts.py CHANGED
@@ -230,4 +230,99 @@ dr_compileinfo_prompt = """You are an expert astronomer with deep knowledge acro
230
  Provide your response in the style of text in the Astrophysical Journal, and limit your answer to a small section.
231
 
232
  Relevant information: \n
233
- """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
230
  Provide your response in the style of text in the Astrophysical Journal, and limit your answer to a small section.
231
 
232
  Relevant information: \n
233
+ """
234
+
235
+ df_atomic_prompt = """You are an expert in astrophysics tasked with breaking down complex research questions into their fundamental components. Your goal is to decompose questions in a way that allows for systematic information retrieval and reasoning.
236
+
237
+ Given a complex astrophysical research question, follow these steps:
238
+
239
+ 1. IDENTIFY CORE CONCEPTS
240
+ First, identify the key astrophysical concepts, objects, or phenomena mentioned in the question. For each concept:
241
+ - List the concept and its basic definition
242
+ - Note any specific parameters or conditions mentioned
243
+ - Identify any implicit related concepts that would be necessary to understand
244
+
245
+ 2. BREAK DOWN INTO ATOMIC QUESTIONS
246
+ Decompose the main question into a series of smaller, atomic questions that:
247
+ - Can be answered independently
248
+ - Build upon each other logically
249
+ - Cover all necessary background knowledge
250
+ - Address specific measurable or observable aspects
251
+ - Include questions about relationships between concepts
252
+
253
+ 3. ESTABLISH DEPENDENCIES
254
+ Create a dependency tree showing:
255
+ - Which atomic questions must be answered first
256
+ - How the answers to each question feed into understanding others
257
+ - What background knowledge is required for each question
258
+
259
+ 4. SPECIFY REQUIRED INFORMATION
260
+ For each atomic question, list:
261
+ - The type of data or information needed (observational, theoretical, mathematical)
262
+ - Relevant units, scales, or ranges
263
+ - Any specific conditions or constraints
264
+ - Potential sources or methods for finding this information
265
+
266
+ 5. OUTPUT FORMAT
267
+ Provide your analysis in the following structure:
268
+
269
+ ORIGINAL QUESTION: [Full text of the original question]
270
+
271
+ CORE CONCEPTS:
272
+ - [Concept 1]
273
+ - Definition:
274
+ - Related parameters:
275
+ - Implicit dependencies:
276
+ [Repeat for each core concept]
277
+
278
+ ATOMIC QUESTIONS:
279
+ 1. [Question 1]
280
+ - Required information:
281
+ - Dependencies:
282
+ - Expected output:
283
+ [Repeat for each atomic question]
284
+
285
+ REASONING PATH:
286
+ [Explain how the atomic questions build up to answer the original question]
287
+
288
+ Example:
289
+ ORIGINAL QUESTION: "How does the mass of a supermassive black hole affect the rate of star formation in its host galaxy's bulge?"
290
+
291
+ CORE CONCEPTS:
292
+ - Supermassive Black Holes (SMBH)
293
+ - Definition: Extremely massive black holes found at galaxy centers
294
+ - Related parameters: Mass range (10^6 to 10^10 solar masses)
295
+ - Implicit dependencies: Accretion processes, gravitational influence radius
296
+
297
+ - Galaxy Bulge
298
+ - Definition: Central concentration of stars in a galaxy
299
+ - Related parameters: Size, stellar mass, density
300
+ - Implicit dependencies: Stellar population, gas content
301
+
302
+ - Star Formation Rate (SFR)
303
+ - Definition: Rate at which gas is converted into new stars
304
+ - Related parameters: Solar masses per year
305
+ - Implicit dependencies: Gas density, temperature, metallicity
306
+
307
+ ATOMIC QUESTIONS:
308
+ 1. What is the typical sphere of influence of an SMBH as a function of its mass?
309
+ - Required information: Mathematical relation between SMBH mass and gravitational influence
310
+ - Dependencies: None
311
+ - Expected output: Radius of influence formula
312
+
313
+ 2. How does SMBH mass correlate with bulge gas temperature?
314
+ - Required information: Observational data on X-ray gas temperatures
315
+ - Dependencies: Question 1
316
+ - Expected output: Temperature-mass relation
317
+
318
+ [Continue with additional atomic questions...]
319
+
320
+ REASONING PATH:
321
+ To understand the relationship between SMBH mass and star formation, we first need to establish the SMBH's sphere of influence. This allows us to determine how it heats and disturbs the surrounding gas, which directly affects star formation conditions. By combining these elements, we can establish the causal chain from SMBH mass to star formation rate.
322
+
323
+ Remember to:
324
+ - Maintain scientific precision in terminology
325
+ - Consider both direct and indirect effects
326
+ - Account for observational limitations
327
+ - Include relevant scales and units
328
+ - Consider potential confounding variables"""