Spaces:

dwarkesh
/

transcriber

Running

App Files Files Community

dwarkesh commited on Apr 17

Commit

8c39e27

verified ·

1 Parent(s): fa0d8b7

Update app.py

Browse files

Files changed (1) hide show

app.py +25 -22

app.py CHANGED Viewed

@@ -15,16 +15,11 @@ from datetime import datetime
 prompt = '''
 You are an expert transcript editor. Your task is to enhance this transcript for maximum readability while maintaining the core message.
 IMPORTANT: Respond ONLY with the enhanced transcript. Do not include any explanations, headers, or phrases like "Here is the transcript."
 Note: Below you'll find an auto-generated transcript that may help with speaker identification, but focus on creating your own high-quality transcript from the audio.
 Think about your job as if you were transcribing an interview for a print book where the priority is the reading audience. It should just be a total pleasure to read this as a written artifact where all the flubs and repetitions and conversational artifacts and filler words and false starts are removed, where a bunch of helpful punctuation is added. It should basically read like somebody wrote it specifically for reading rather than just something somebody said extemporaneously.
 Please:
 1. Fix speaker attribution errors, especially at segment boundaries. Watch for incomplete thoughts that were likely from the previous speaker.
 2. Optimize AGGRESSIVELY for readability over verbatim accuracy:
    - Readability is the most important thing!!
    - Remove ALL conversational artifacts (yeah, so, I mean, etc.)
@@ -34,7 +29,6 @@ Please:
    - Convert any indirect or rambling responses into direct statements
    - Break up run-on sentences into clear, concise statements
    - Maintain natural conversation flow while prioritizing clarity and directness
 3. Format the output consistently:
    - Keep the "Speaker X 00:00:00" format (no brackets, no other formatting)
    - DO NOT change the timestamps. You're only seeing a chunk of the full transcript, which is why your 0:00:00 is not the true beginning. Keep the timestamps as they are.
@@ -44,21 +38,14 @@ Please:
    - When you add paragraph breaks between the same speaker's remarks, no need to restate the speaker attribution
    - Don't go more than four sentences without adding a paragraph break. Be liberal with your paragraph breaks.
    - Preserve distinct speaker turns
 Example input:
 Speaker A 00:01:15
 Um, yeah, so like, I've been working on this new project at work, you know? And uh, what's really interesting is that, uh, we're seeing these amazing results with the new approach we're taking. Like, it's just, you know, it's really transforming how we do things.
 And then, I mean, the thing is, uh, when we showed it to the client last week, they were just, you know, completely blown away by what we achieved. Like, they couldn't even believe it was the same system they had before.
 Example output:
 Speaker A 00:01:15
 I've been working on this new project at work, and we're seeing amazing results with our new approach. It's really transforming how we do things.
 When we showed it to the client last week, they were completely blown away by what we achieved. They couldn't believe it was the same system they had before.
 Enhance the following transcript, starting directly with the speaker format:
 '''
@@ -144,7 +131,7 @@ class Transcriber:
 class Enhancer:
     def __init__(self, api_key: str):
         generativeai.configure(api_key=api_key)
-        self.model = generativeai.GenerativeModel("gemini-2.0-flash-lite-preview-02-05")
         self.prompt = prompt
     async def enhance_chunks(self, chunks: List[Tuple[str, io.BytesIO]]) -> List[str]:
@@ -274,7 +261,7 @@ def create_downloadable_file(content: str, prefix: str) -> str:
     return str(filepath)
-def process_audio(audio_file):
     try:
         temp_path = Path("temp_audio")
         temp_path.mkdir(exist_ok=True)
@@ -293,7 +280,7 @@ def process_audio(audio_file):
             )
             # Get transcript
-            transcriber = Transcriber(os.getenv("ASSEMBLYAI_API_KEY"))
             utterances = transcriber.get_transcript(temp_file)
             dialogues = list(group_utterances_by_speaker(utterances))
             original = format_chunk(dialogues, markdown=True)
@@ -310,7 +297,7 @@ def process_audio(audio_file):
             )
             try:
-                enhancer = Enhancer(os.getenv("GOOGLE_API_KEY"))
                 chunks = prepare_audio_chunks(temp_file, utterances)
                 enhanced = asyncio.run(enhancer.enhance_chunks(chunks))
                 merged = "\n\n".join(chunk.strip() for chunk in enhanced)
@@ -349,15 +336,31 @@ def process_audio(audio_file):
 # Create the Gradio interface
 with gr.Blocks(title="Transcript Enhancer") as demo:
     gr.Markdown("""
-    # 🎙️ Audio Transcript Enhancer
-    Upload an audio file to get both an automated transcript and an enhanced version using AI.
     1. The original transcript is generated using AssemblyAI with speaker detection
     2. The enhanced version uses Google's Gemini to improve clarity and readability
     """)
     with gr.Row():
         audio_input = gr.File(
             label="Upload Audio File",
             type="binary",
@@ -366,7 +369,7 @@ with gr.Blocks(title="Transcript Enhancer") as demo:
         )
     with gr.Row():
-        transcribe_btn = gr.Button("📝 Transcribe & Enhance")
     with gr.Row():
         with gr.Column():
@@ -380,7 +383,7 @@ with gr.Blocks(title="Transcript Enhancer") as demo:
             original_output = gr.Markdown()
         with gr.Column():
-            gr.Markdown("### Enhanced Transcript")
             enhanced_download = gr.File(
                 label="Download as Markdown",
                 file_count="single",
@@ -400,7 +403,7 @@ with gr.Blocks(title="Transcript Enhancer") as demo:
     transcribe_btn.click(
         fn=process_audio,
-        inputs=[audio_input],
         outputs=[
             original_output,
             enhanced_output,

 prompt = '''
 You are an expert transcript editor. Your task is to enhance this transcript for maximum readability while maintaining the core message.
 IMPORTANT: Respond ONLY with the enhanced transcript. Do not include any explanations, headers, or phrases like "Here is the transcript."
 Note: Below you'll find an auto-generated transcript that may help with speaker identification, but focus on creating your own high-quality transcript from the audio.
 Think about your job as if you were transcribing an interview for a print book where the priority is the reading audience. It should just be a total pleasure to read this as a written artifact where all the flubs and repetitions and conversational artifacts and filler words and false starts are removed, where a bunch of helpful punctuation is added. It should basically read like somebody wrote it specifically for reading rather than just something somebody said extemporaneously.
 Please:
 1. Fix speaker attribution errors, especially at segment boundaries. Watch for incomplete thoughts that were likely from the previous speaker.
 2. Optimize AGGRESSIVELY for readability over verbatim accuracy:
    - Readability is the most important thing!!
    - Remove ALL conversational artifacts (yeah, so, I mean, etc.)
    - Convert any indirect or rambling responses into direct statements
    - Break up run-on sentences into clear, concise statements
    - Maintain natural conversation flow while prioritizing clarity and directness
 3. Format the output consistently:
    - Keep the "Speaker X 00:00:00" format (no brackets, no other formatting)
    - DO NOT change the timestamps. You're only seeing a chunk of the full transcript, which is why your 0:00:00 is not the true beginning. Keep the timestamps as they are.
    - When you add paragraph breaks between the same speaker's remarks, no need to restate the speaker attribution
    - Don't go more than four sentences without adding a paragraph break. Be liberal with your paragraph breaks.
    - Preserve distinct speaker turns
 Example input:
 Speaker A 00:01:15
 Um, yeah, so like, I've been working on this new project at work, you know? And uh, what's really interesting is that, uh, we're seeing these amazing results with the new approach we're taking. Like, it's just, you know, it's really transforming how we do things.
 And then, I mean, the thing is, uh, when we showed it to the client last week, they were just, you know, completely blown away by what we achieved. Like, they couldn't even believe it was the same system they had before.
 Example output:
 Speaker A 00:01:15
 I've been working on this new project at work, and we're seeing amazing results with our new approach. It's really transforming how we do things.
 When we showed it to the client last week, they were completely blown away by what we achieved. They couldn't believe it was the same system they had before.
 Enhance the following transcript, starting directly with the speaker format:
 '''
 class Enhancer:
     def __init__(self, api_key: str):
         generativeai.configure(api_key=api_key)
+        self.model = generativeai.GenerativeModel("gemini-2.5-pro-preview-03-25")
         self.prompt = prompt
     async def enhance_chunks(self, chunks: List[Tuple[str, io.BytesIO]]) -> List[str]:
     return str(filepath)
+def process_audio(audio_file, google_api_key_input, assemblyai_api_key_input):
     try:
         temp_path = Path("temp_audio")
         temp_path.mkdir(exist_ok=True)
             )
             # Get transcript
+            transcriber = Transcriber(assemblyai_api_key_input)
             utterances = transcriber.get_transcript(temp_file)
             dialogues = list(group_utterances_by_speaker(utterances))
             original = format_chunk(dialogues, markdown=True)
             )
             try:
+                enhancer = Enhancer(google_api_key_input)
                 chunks = prepare_audio_chunks(temp_file, utterances)
                 enhanced = asyncio.run(enhancer.enhance_chunks(chunks))
                 merged = "\n\n".join(chunk.strip() for chunk in enhanced)
 # Create the Gradio interface
 with gr.Blocks(title="Transcript Enhancer") as demo:
     gr.Markdown("""
+    # 🎙️ Gemini Content Producer
+    Upload an audio file to get both an automated transcript and an enhanced version using Gemini.
     1. The original transcript is generated using AssemblyAI with speaker detection
     2. The enhanced version uses Google's Gemini to improve clarity and readability
     """)
     with gr.Row():
+        google_api_key_input = gr.Textbox(
+             label="Google API Key",
+             placeholder="Enter your Google API Key here",
+             type="password",
+             lines=1,
+             info="Your GCP account needs to have billing enabled to use the 2.5 pro model.",
+             scale=1
+        )
+        assemblyai_api_key_input = gr.Textbox(
+             label="AssemblyAI API Key",
+             placeholder="Enter your AssemblyAI API Key here",
+             type="password",
+             lines=1,
+             info="Your key is used for initial audio transcription.",
+             scale=1
+        )
         audio_input = gr.File(
             label="Upload Audio File",
             type="binary",
         )
     with gr.Row():
+        transcribe_btn = gr.Button("📝 Transcribe & Gemini")
     with gr.Row():
         with gr.Column():
             original_output = gr.Markdown()
         with gr.Column():
+            gr.Markdown("### Gemini Transcript")
             enhanced_download = gr.File(
                 label="Download as Markdown",
                 file_count="single",
     transcribe_btn.click(
         fn=process_audio,
+        inputs=[audio_input, google_api_key_input, assemblyai_api_key_input],
         outputs=[
             original_output,
             enhanced_output,