Spaces:

seanpedrickcase
/

llm_topic_modelling

Running

App Files Files Community

seanpedrickcase commited on Dec 4, 2024

Commit

f8f34c2

1 Parent(s): 991fcdc

Added more guidance in Readme. Now wipes variables on click to create or summarise topics

Browse files

Files changed (4) hide show

README.md +11 -4
app.py +14 -14
tools/helper_functions.py +29 -0
tools/llm_api_call.py +8 -15

README.md CHANGED Viewed

@@ -5,7 +5,7 @@ colorFrom: purple
 colorTo: yellow
 sdk: gradio
 app_file: app.py
-pinned: false
 license: cc-by-nc-4.0
 ---
@@ -13,8 +13,15 @@ license: cc-by-nc-4.0
 Extract topics and summarise outputs using Large Language Models (LLMs, Gemini Flash/Pro, or Claude 3 through AWS Bedrock if running on AWS). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
-You can use an AWS Bedrock model (Claude 3, paid), or Gemini (a free API, but with strict limits for the Pro model). Due to the strict API limits for the best model (Pro 1.5), the use of Gemini requires an API key. To set up your own Gemini API key, go here: https://aistudio.google.com/app/u/1/plan_information.
-NOTE: that **API calls to Gemini are not considered secure**, so please only submit redacted, non-sensitive tabular files to this source. AWS Bedrock API calls are considered to be secure.
-Large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.

 colorTo: yellow
 sdk: gradio
 app_file: app.py
+pinned: true
 license: cc-by-nc-4.0
 ---
 Extract topics and summarise outputs using Large Language Models (LLMs, Gemini Flash/Pro, or Claude 3 through AWS Bedrock if running on AWS). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
+You can use an AWS Bedrock model (Claude 3, paid), or Gemini (a free API, but with strict limits for the Pro model). Due to the strict API limits for the best model (Pro 1.5), the use of Gemini requires an API key. To set up your own Gemini API key, go here: https://aistudio.google.com/app/u/1/plan_information.
+NOTE: that **API calls to Gemini are not considered secure**, so please only submit redacted, non-sensitive tabular files to this source. Also, large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.
+Basic use:
+1. Upload a csv/xlsx/parquet file containing at least one open text column.
+2. Select the relevant open text column from the dropdown.
+3. If you have your own suggested (zero shot) topics, upload this (see examples folder for an example file)
+4. Write a one sentence description of the consultation/context of the open text.
+5. Extract topics.
+6. If topic extraction fails part way through, you can upload the latest 'reference_table' and 'unique_topics_table' csv outputs on the 'Continue previous topic extraction' tab to continue from where you left off.
+7. Summaries will be produced for each topic for each 'batch' of responses. If you want consolidated summaries, go to the tab 'Summarise topic outputs', upload your output reference_table and unique_topics csv files, and press summarise.

app.py CHANGED Viewed

@@ -1,6 +1,6 @@
 import os
 import socket
-from tools.helper_functions import ensure_output_folder_exists, add_folder_to_path, put_columns_in_df, get_connection_params, output_folder, get_or_create_env_var, reveal_feedback_buttons, wipe_logs, model_full_names, view_table
 from tools.aws_functions import upload_file_to_s3
 from tools.llm_api_call import llm_query, load_in_data_file, load_in_previous_data_files, sample_reference_table_summaries, summarise_output_topics
 from tools.auth import authenticate_user
@@ -69,13 +69,11 @@ with app:
     gr.Markdown(
     """# Large language model topic modelling
-    Extract topics and summarise outputs using Large Language Models (LLMs, Gemini Flash/Pro, or Claude 3 through AWS Bedrock if running on AWS). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
-    You can use an AWS Bedrock model (Claude 3, paid), or Gemini (a free API, but with strict limits for the Pro model). Due to the strict API limits for the best model (Pro 1.5), the use of Gemini requires an API key. To set up your own Gemini API key, go here: https://aistudio.google.com/app/u/1/plan_information.
-    NOTE: that **API calls to Gemini are not considered secure**, so please only submit redacted, non-sensitive tabular files to this source. AWS Bedrock API calls are considered to be secure.
-    Large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.""")
     with gr.Tab(label="Extract topics"):
         gr.Markdown(
@@ -94,7 +92,7 @@ with app:
         in_colnames = gr.Dropdown(choices=["Choose column with responses"], multiselect = False, label="Select column that contains the responses (showing columns present across all files).", allow_custom_value=True, interactive=True)
         with gr.Accordion("I have my own list of topics (zero shot topic modelling).", open = False):
-            candidate_topics = gr.File(label="Input topics from file (csv). File should have at least one column with a header and topic keywords in cells below. Topics will be taken from the first column of the file.")
         context_textbox = gr.Textbox(label="Write a short description (one sentence of less) giving context to the large language model about the your consultation and any relevant context")
@@ -197,7 +195,8 @@ with app:
      # Tabular data upload
     in_data_files.upload(fn=put_columns_in_df, inputs=[in_data_files], outputs=[in_colnames, in_excel_sheets, data_file_names_textbox])
-    extract_topics_btn.click(load_in_data_file,
         inputs = [in_data_files, in_colnames, batch_size_number], outputs = [file_data_state, data_file_names_textbox, total_number_of_batches], api_name="load_data").then(\
         fn=llm_query,
         inputs=[file_data_state, master_topic_df_state, master_reference_df_state, master_unique_topics_df_state, text_output_summary, data_file_names_textbox, total_number_of_batches, in_api_key, temperature_slide, in_colnames, model_choice, candidate_topics, latest_batch_completed, text_output_summary, text_output_file_list_state, log_files_output_list_state, first_loop_state, conversation_metadata_textbox, initial_table_prompt_textbox, prompt_2_textbox, prompt_3_textbox, system_prompt_textbox, add_to_existing_topics_system_prompt_textbox, add_to_existing_topics_prompt_textbox, number_of_prompts, batch_size_number, context_textbox, estimated_time_taken_number],
@@ -210,18 +209,19 @@ with app:
         then(fn = reveal_feedback_buttons,
         outputs=[data_feedback_radio, data_further_details_text, data_submit_feedback_btn, data_feedback_title], scroll_to_output=True)
-    # If uploaded partially completed consultation files do this. This should then start up the 'latest_batch_completed' change action above to continue extracting topics.
-    continue_previous_data_files_btn.click(
-            load_in_data_file, inputs = [in_data_files, in_colnames, batch_size_number], outputs = [file_data_state, data_file_names_textbox, total_number_of_batches]).\
-            then(load_in_previous_data_files, inputs=[in_previous_data_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed, in_previous_data_files_status, data_file_names_textbox])
     # When button pressed, summarise previous data
-    summarise_previous_data_btn.click(load_in_previous_data_files, inputs=[summarisation_in_previous_data_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, summarisation_in_previous_data_files_status, data_file_names_textbox]).\
     then(sample_reference_table_summaries, inputs=[master_reference_df_state, master_unique_topics_df_state, random_seed], outputs=[summary_reference_table_sample_state, summarised_references_markdown, master_reference_df_state, master_unique_topics_df_state]).\
     then(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, in_api_key, summarised_references_markdown, temperature_slide, data_file_names_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox])
     latest_summary_completed_num.change(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, in_api_key, summarised_references_markdown, temperature_slide, data_file_names_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox])
     ###
     # LOGGING AND ON APP LOAD FUNCTIONS
     ###

 import os
 import socket
+from tools.helper_functions import ensure_output_folder_exists, add_folder_to_path, put_columns_in_df, get_connection_params, output_folder, get_or_create_env_var, reveal_feedback_buttons, wipe_logs, model_full_names, view_table, empty_output_vars_extract_topics, empty_output_vars_summarise
 from tools.aws_functions import upload_file_to_s3
 from tools.llm_api_call import llm_query, load_in_data_file, load_in_previous_data_files, sample_reference_table_summaries, summarise_output_topics
 from tools.auth import authenticate_user
     gr.Markdown(
     """# Large language model topic modelling
+    Extract topics and summarise outputs using Large Language Models (LLMs, Gemini Flash/Pro, or Claude 3 through AWS Bedrock if running on AWS). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify). Instructions on use can be found in the README.md file.
+    You can use an AWS Bedrock model (Claude 3, paid), or Gemini (a free API, but with strict limits for the Pro model). Due to the strict API limits for the best model (Pro 1.5), the use of Gemini requires an API key. To set up your own Gemini API key, go here: https://aistudio.google.com/app/u/1/plan_information.
+    NOTE: that **API calls to Gemini are not considered secure**, so please only submit redacted, non-sensitive tabular files to this source. Also, large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.""")
     with gr.Tab(label="Extract topics"):
         gr.Markdown(
         in_colnames = gr.Dropdown(choices=["Choose column with responses"], multiselect = False, label="Select column that contains the responses (showing columns present across all files).", allow_custom_value=True, interactive=True)
         with gr.Accordion("I have my own list of topics (zero shot topic modelling).", open = False):
+            candidate_topics = gr.File(label="Input topics from file (csv). File should have a single column with a header, and all topic keywords below.")
         context_textbox = gr.Textbox(label="Write a short description (one sentence of less) giving context to the large language model about the your consultation and any relevant context")
      # Tabular data upload
     in_data_files.upload(fn=put_columns_in_df, inputs=[in_data_files], outputs=[in_colnames, in_excel_sheets, data_file_names_textbox])
+    extract_topics_btn.click(fn=empty_output_vars_extract_topics, inputs=None, outputs=[master_topic_df_state, master_unique_topics_df_state, master_reference_df_state, text_output_file, text_output_file_list_state, latest_batch_completed, log_files_output, log_files_output_list_state, conversation_metadata_textbox, estimated_time_taken_number]).\
+    then(load_in_data_file,
         inputs = [in_data_files, in_colnames, batch_size_number], outputs = [file_data_state, data_file_names_textbox, total_number_of_batches], api_name="load_data").then(\
         fn=llm_query,
         inputs=[file_data_state, master_topic_df_state, master_reference_df_state, master_unique_topics_df_state, text_output_summary, data_file_names_textbox, total_number_of_batches, in_api_key, temperature_slide, in_colnames, model_choice, candidate_topics, latest_batch_completed, text_output_summary, text_output_file_list_state, log_files_output_list_state, first_loop_state, conversation_metadata_textbox, initial_table_prompt_textbox, prompt_2_textbox, prompt_3_textbox, system_prompt_textbox, add_to_existing_topics_system_prompt_textbox, add_to_existing_topics_prompt_textbox, number_of_prompts, batch_size_number, context_textbox, estimated_time_taken_number],
         then(fn = reveal_feedback_buttons,
         outputs=[data_feedback_radio, data_further_details_text, data_submit_feedback_btn, data_feedback_title], scroll_to_output=True)
     # When button pressed, summarise previous data
+    summarise_previous_data_btn.click(empty_output_vars_summarise, inputs=None, outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox]).\
+    then(load_in_previous_data_files, inputs=[summarisation_in_previous_data_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, summarisation_in_previous_data_files_status, data_file_names_textbox]).\
     then(sample_reference_table_summaries, inputs=[master_reference_df_state, master_unique_topics_df_state, random_seed], outputs=[summary_reference_table_sample_state, summarised_references_markdown, master_reference_df_state, master_unique_topics_df_state]).\
     then(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, in_api_key, summarised_references_markdown, temperature_slide, data_file_names_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox])
     latest_summary_completed_num.change(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, in_api_key, summarised_references_markdown, temperature_slide, data_file_names_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox])
+    # If uploaded partially completed consultation files do this. This should then start up the 'latest_batch_completed' change action above to continue extracting topics.
+    continue_previous_data_files_btn.click(
+            load_in_data_file, inputs = [in_data_files, in_colnames, batch_size_number], outputs = [file_data_state, data_file_names_textbox, total_number_of_batches]).\
+            then(load_in_previous_data_files, inputs=[in_previous_data_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed, in_previous_data_files_status, data_file_names_textbox])
     ###
     # LOGGING AND ON APP LOAD FUNCTIONS
     ###

tools/helper_functions.py CHANGED Viewed

@@ -3,6 +3,35 @@ import gradio as gr
 import pandas as pd
 def get_or_create_env_var(var_name, default_value):
     # Get the environment variable if it exists

 import pandas as pd
+def empty_output_vars_extract_topics():
+    # Empty output objects before processing a new file
+    master_topic_df_state = pd.DataFrame()
+    master_unique_topics_df_state = pd.DataFrame()
+    master_reference_df_state = pd.DataFrame()
+    text_output_file = []
+    text_output_file_list_state = []
+    latest_batch_completed = 0
+    log_files_output = []
+    log_files_output_list_state = []
+    conversation_metadata_textbox = ""
+    estimated_time_taken_number = 0
+    return master_topic_df_state, master_unique_topics_df_state, master_reference_df_state, text_output_file, text_output_file_list_state, latest_batch_completed, log_files_output, log_files_output_list_state, conversation_metadata_textbox, estimated_time_taken_number
+def empty_output_vars_summarise():
+    # Empty output objects before summarising files
+    summary_reference_table_sample_state = pd.DataFrame()
+    master_unique_topics_df_revised_summaries_state = pd.DataFrame()
+    master_reference_df_revised_summaries_state = pd.DataFrame()
+    summary_output_files = []
+    summarised_outputs_list = []
+    latest_summary_completed_num = 0
+    conversation_metadata_textbox = ""
+    return summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox
 def get_or_create_env_var(var_name, default_value):
     # Get the environment variable if it exists

tools/llm_api_call.py CHANGED Viewed

@@ -1242,8 +1242,6 @@ def deduplicate_categories(category_series: pd.Series, join_series:pd.Series, th
         if category in deduplication_map:
             continue
-        print("old_category:", category)
         # Find close matches to the current category, excluding the current category itself
         matches = process.extract(category, [cat for cat in category_series.unique() if cat != category], scorer=fuzz.token_set_ratio, score_cutoff=threshold)
@@ -1251,7 +1249,7 @@ def deduplicate_categories(category_series: pd.Series, join_series:pd.Series, th
         if matches:  # Check if there are any matches
             best_match = max(matches, key=lambda x: x[1])  # Get the match with the highest score
             match, score, _ = best_match  # Unpack the best match
-            print("Best match:", match, "score:", score)
             deduplication_map[match] = category  # Map the best match to the current category
     # Create the result DataFrame
@@ -1282,8 +1280,6 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
             reference_df_unique = reference_df.drop_duplicates("old_category")
-            print("reference_df_unique_old_categories:", reference_df_unique["old_category"])
             reference_df_unique[["old_category"]].to_csv(output_folder + "reference_df_unique_old_categories_" + str(i) + ".csv", index=None)
             # Deduplicate categories within each sentiment group
@@ -1293,7 +1289,6 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
             if deduplicated_topic_map_df['deduplicated_category'].isnull().all():
             # Check if 'deduplicated_category' contains any values
                 print("No deduplicated categories found, skipping the following code.")
             else:
@@ -1301,7 +1296,7 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
                 # Remove rows where 'deduplicated_category' is blank or NaN
                 deduplicated_topic_map_df = deduplicated_topic_map_df.loc[(deduplicated_topic_map_df['deduplicated_category'].str.strip() != '') & ~(deduplicated_topic_map_df['deduplicated_category'].isnull()), :]
-                deduplicated_topic_map_df.to_csv(output_folder + "deduplicated_topic_map_df_" + str(i) + ".csv", index=None)
                 reference_df = reference_df.merge(deduplicated_topic_map_df, on="old_category", how="left")
@@ -1314,7 +1309,7 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
                 reference_df["Subtopic"] = reference_df["deduplicated_category"].combine_first(reference_df["Subtopic_old"])
                 reference_df["Sentiment"] = reference_df["Sentiment"].combine_first(reference_df["Sentiment_old"])
-            reference_df.to_csv(output_folder + "reference_df_after_dedup.csv", index=None)
             reference_df.drop(['old_category', 'deduplicated_category', "Subtopic_old", "Sentiment_old"], axis=1, inplace=True, errors="ignore")
@@ -1324,8 +1319,6 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
             reference_df["Subtopic"] = reference_df["Subtopic"].str.lower().str.capitalize()
             reference_df["Sentiment"] = reference_df["Sentiment"].str.lower().str.capitalize()
         # Remake unique_topics_df based on new reference_df
         unique_topics_df = create_unique_table_df_from_reference_table(reference_df)
@@ -1351,7 +1344,7 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
             all_summaries = pd.concat([all_summaries, filtered_reference_df_unique_sampled])
-    all_summaries.to_csv(output_folder + "all_summaries.csv", index=None)
     summarised_references = all_summaries.groupby(["General Topic", "Subtopic", "Sentiment"]).agg({
     'Response References': 'size',  # Count the number of references
@@ -1360,7 +1353,7 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
     summarised_references = summarised_references.loc[(summarised_references["Sentiment"] != "Not Mentioned") & (summarised_references["Response References"] > 1)]
-    summarised_references.to_csv(output_folder + "summarised_references.csv", index=None)
     summarised_references_markdown = summarised_references.to_markdown(index=False)
@@ -1420,8 +1413,8 @@ def summarise_output_topics(summarised_references:pd.DataFrame,
     length_all_summaries = len(all_summaries)
-    print("latest_summary_completed:", latest_summary_completed)
-    print("length_all_summaries:", length_all_summaries)
     if latest_summary_completed >= length_all_summaries:
         print("All summaries completed. Creating outputs.")
@@ -1463,7 +1456,7 @@ def summarise_output_topics(summarised_references:pd.DataFrame,
         unique_table_df_revised_path = output_folder + batch_file_path_details + "_summarised_unique_topic_table_" + model_choice_clean + ".csv"
         unique_table_df_revised.to_csv(unique_table_df_revised_path, index = None)
-        reference_table_df_revised_path = output_folder + batch_file_path_details + "_summarised_reference_df_table_" + model_choice_clean + ".csv"
         reference_table_df_revised.to_csv(reference_table_df_revised_path, index = None)
         output_files.extend([reference_table_df_revised_path, unique_table_df_revised_path])

         if category in deduplication_map:
             continue
         # Find close matches to the current category, excluding the current category itself
         matches = process.extract(category, [cat for cat in category_series.unique() if cat != category], scorer=fuzz.token_set_ratio, score_cutoff=threshold)
         if matches:  # Check if there are any matches
             best_match = max(matches, key=lambda x: x[1])  # Get the match with the highest score
             match, score, _ = best_match  # Unpack the best match
+            #print("Best match:", match, "score:", score)
             deduplication_map[match] = category  # Map the best match to the current category
     # Create the result DataFrame
             reference_df_unique = reference_df.drop_duplicates("old_category")
             reference_df_unique[["old_category"]].to_csv(output_folder + "reference_df_unique_old_categories_" + str(i) + ".csv", index=None)
             # Deduplicate categories within each sentiment group
             if deduplicated_topic_map_df['deduplicated_category'].isnull().all():
             # Check if 'deduplicated_category' contains any values
                 print("No deduplicated categories found, skipping the following code.")
             else:
                 # Remove rows where 'deduplicated_category' is blank or NaN
                 deduplicated_topic_map_df = deduplicated_topic_map_df.loc[(deduplicated_topic_map_df['deduplicated_category'].str.strip() != '') & ~(deduplicated_topic_map_df['deduplicated_category'].isnull()), :]
+                #deduplicated_topic_map_df.to_csv(output_folder + "deduplicated_topic_map_df_" + str(i) + ".csv", index=None)
                 reference_df = reference_df.merge(deduplicated_topic_map_df, on="old_category", how="left")
                 reference_df["Subtopic"] = reference_df["deduplicated_category"].combine_first(reference_df["Subtopic_old"])
                 reference_df["Sentiment"] = reference_df["Sentiment"].combine_first(reference_df["Sentiment_old"])
+            #reference_df.to_csv(output_folder + "reference_table_after_dedup.csv", index=None)
             reference_df.drop(['old_category', 'deduplicated_category', "Subtopic_old", "Sentiment_old"], axis=1, inplace=True, errors="ignore")
             reference_df["Subtopic"] = reference_df["Subtopic"].str.lower().str.capitalize()
             reference_df["Sentiment"] = reference_df["Sentiment"].str.lower().str.capitalize()
         # Remake unique_topics_df based on new reference_df
         unique_topics_df = create_unique_table_df_from_reference_table(reference_df)
             all_summaries = pd.concat([all_summaries, filtered_reference_df_unique_sampled])
+    #all_summaries.to_csv(output_folder + "all_summaries.csv", index=None)
     summarised_references = all_summaries.groupby(["General Topic", "Subtopic", "Sentiment"]).agg({
     'Response References': 'size',  # Count the number of references
     summarised_references = summarised_references.loc[(summarised_references["Sentiment"] != "Not Mentioned") & (summarised_references["Response References"] > 1)]
+    #summarised_references.to_csv(output_folder + "summarised_references.csv", index=None)
     summarised_references_markdown = summarised_references.to_markdown(index=False)
     length_all_summaries = len(all_summaries)
+    #print("latest_summary_completed:", latest_summary_completed)
+    #print("length_all_summaries:", length_all_summaries)
     if latest_summary_completed >= length_all_summaries:
         print("All summaries completed. Creating outputs.")
         unique_table_df_revised_path = output_folder + batch_file_path_details + "_summarised_unique_topic_table_" + model_choice_clean + ".csv"
         unique_table_df_revised.to_csv(unique_table_df_revised_path, index = None)
+        reference_table_df_revised_path = output_folder + batch_file_path_details + "_summarised_reference_table_" + model_choice_clean + ".csv"
         reference_table_df_revised.to_csv(reference_table_df_revised_path, index = None)
         output_files.extend([reference_table_df_revised_path, unique_table_df_revised_path])