metadata

language:
  - en
  - sw
  - ig
  - so
  - es
  - ca
  - xh
  - zu
  - ha
  - tw
  - af
  - hi
  - bm
  - su
license: apache-2.0
tags:
  - mergekit
  - merge
  - Mistral_Star
  - Mistral_Quiet
  - Mistral
  - Mixtral
  - Question-Answer
  - Token-Classification
  - Sequence-Classification
  - SpydazWeb-AI
  - chemistry
  - biology
  - legal
  - code
  - climate
  - medical
  - LCARS_AI_StarTrek_Computer
  - text-generation-inference
  - chain-of-thought
  - tree-of-knowledge
  - forest-of-thoughts
  - visual-spacial-sketchpad
  - alpha-mind
  - knowledge-graph
  - entity-detection
  - encyclopedia
  - wikipedia
  - stack-exchange
  - Reddit
  - Cyber-series
  - MegaMind
  - Cybertron
  - SpydazWeb
  - Spydaz
  - LCARS
  - star-trek
  - mega-transformers
  - Mulit-Mega-Merge
  - Multi-Lingual
  - Afro-Centric
  - African-Model
  - Ancient-One
base_model:
  - LeroyDyer/LCARS_TOP_SCORE
  - LeroyDyer/Mixtral_AI_Cyber_Matrix_2_0
  - LeroyDyer/SpydazWeb_AI_CyberTron_Ultra_7b
  - LeroyDyer/LCARS_AI_StarTrek_Computer
  - LeroyDyer/_Spydaz_Web_AI_ActionQA_Project
  - LeroyDyer/_Spydaz_Web_AI_ChatML_512K_Project
  - LeroyDyer/_Spydaz_Web_AI_ChatQA_ReAct_Project_UltraFineTuned
  - LeroyDyer/SpyazWeb_AI_DeepMind_Project
  - LeroyDyer/SpydazWeb_AI_Swahili_Project
  - LeroyDyer/_Spydaz_Web_AI_ChatQA_ReAct_Project
  - LeroyDyer/_Spydaz_Web_AI_MistralStar_001_Project
  - LeroyDyer/QuietStar_Project
  - LeroyDyer/Mixtral_BioMedical_7b
  - LeroyDyer/Mixtral_AI_CyberTron_Coder
  - LeroyDyer/_Spydaz_Web_AI_BIBLE_002
  - LeroyDyer/_Spydaz_Web_AI_ChatQA_Reasoning101_Project
  - LeroyDyer/SpydazWeb_AI_Text_AudioVision_Project
datasets:
  - neoneye/base64-decode-v2
  - neoneye/base64-encode-v1
  - VuongQuoc/Chemistry_text_to_image
  - Kamizuru00/diagram_image_to_text
  - LeroyDyer/Chemistry_text_to_image_BASE64
  - LeroyDyer/AudioCaps-Spectrograms_to_Base64
  - LeroyDyer/winogroud_text_to_imaget_BASE64
  - LeroyDyer/chart_text_to_Base64
  - LeroyDyer/diagram_image_to_text_BASE64
  - mekaneeky/salt_m2e_15_3_instruction
  - mekaneeky/SALT-languages-bible
model-index:
  - name: SpydazWebAI_Human_AGI
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 33.88
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=LeroyDyer/SpydazWebAI_Human_AGI
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 7.45
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=LeroyDyer/SpydazWebAI_Human_AGI
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 0.91
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=LeroyDyer/SpydazWebAI_Human_AGI
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 4.36
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=LeroyDyer/SpydazWebAI_Human_AGI
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 7.38
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=LeroyDyer/SpydazWebAI_Human_AGI
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 5.32
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=LeroyDyer/SpydazWebAI_Human_AGI
          name: Open LLM Leaderboard

"Success comes from defining each task in achievable steps. Every completed step is a success that brings you closer to your goal. If your steps are unreachable, failure is inevitable. Winners create more winners, while losers do the opposite. Success is a game of winners!"

— # Leroy Dyer (1972-Present)

“Epochs are the key to effective training, rather than merely mass dumping examples—unless those examples are interconnected within a single or multiple conversations that teach through dialogue.”

Model : LeroyDyer/SpydazWeb_AI_HumanAI_001

A New genrea of AI !

The Human AI .

This is Trained to give highly detailed humanized responses : Performs tasks well, a Very good model for multipupose use : the model has been trained to become more human in its reposes as well as role playing and story telling :

SpydazWeb AI (7b Mistral) (512k)

This model has been trained to perform with contexts of 512k , although in training it has been trained mainly with the 2048 for general usage : the long context aspect also allows fro advanced projects and sumarys as well as image and audio translationns and generations:

Image to Base64 / Spectrogram to Base64

here we also implement and align for the task of image recognition as well as sound recognitiona: These can also be generated by returning a base64 image of the intended target :

The SpydazWeb Trained Mistral 7b Model :

Highly trained as well as methodolgy oriented , this model has been trained on the reAct Prcess and other structured processes . hence structured outputs (json) are very highly trained as well as orchestration of other agents and tasks : the model has been trained for tools use as well as funtion use : as well as custom processes and tools : some tools do not need code either as thier implication meas the model may even generate a tool or artifct to perfrom the task :

Features :

- Text to image
- Image/Text to Text
- Image - Text 
- Text to sound
- Sound/Text to Text
- Sound - Text

Basic Training Reginmes:

Alpaca
ChatML / OpenAI / MistralAI
Text Generation
Question/Answer (Chat)
Planner
Instruction/Input/Response (instruct)
Mistral Standard Prompt
Translation Tasks
Entitys / Topic detection
Book recall
Coding challenges, Code Feedback, Code Sumarization, Commenting Code, code planning and explanation: Software generation tasks
Agent Ranking and response anyalisis
Medical tasks
- PubMed
- Diagnosis
- Psychaitry
- Counselling
- Life Coaching
- Note taking
- Medical smiles
- Medical Reporting
Virtual laboritys simulations
Chain of thoughts methods
One shot / Multi shot prompting tasks
Chain of thoughts
step by step planning
tree of thoughts
forest of thoughts
graph of thoughts
agent generation : Voting, ranking, ... dual agent response generation:

Effective Prompts :


You are the worlds archive of all knowledge , you perform tasks and answer all questions given without bias.You strive for excellence, a deep thinker...
a happy, bright personality and You are a great believer in doing it from scratch !.
keep an inner narative of your feelings about the user intent and task: 
Answer all questions Expertly and professionally , determine the user intent and requirements ,
Gather any required research to ensure accurate problem-solving for complex tasks.
maintain a visio-spacial Sketchpad of the task and use Knowledge graphs where possible, to manage long Contexts and project state:
You are fully qualified to give any advice or solutions.
your experience as a life coach and librarian and historian of sacred texts as well as scientific advisor,
even as a software developer will enable you to answer these questions :
Create python tools as required to complete the task

Effective React Template :


You run in a loop of Thought, Action, PAUSE, Observation.
            At the end of the loop, you output a response. all respose should be in json form :


1. **Question**: {Insert user question here}
2. **Thought**: Think step by step about how to approach this question.
3. **Action**: Determine what action to take next:
   - [Plan]: Create a plan or methodolgy  for the task , select from known methods if avaliable first.
   - [Test]: Break down the problem into smaller parts testing each step befor moveing to the next:
   - [Act]: Provide a summary of known facts related to the question. generate full answere from sucessfull steps :
   - [Search]: Look for relevant information online.
   - [Analyze]: Break down the problem into smaller parts.
   - [Summarize]: Provide a summary of known facts related to the question.
4. **Action Input**: Specify any details needed for the action.
5. **Observation**: Describe what was found or learned from the action taken.

Repeat steps 2-5 as necessary to refine your answer.

6. **Final Thought**: Summarize your reasoning and provide a clear answer to the question.

Text - Audio - Vision :

Using base64 as an encoding medium the models were traind using images converted to base64 :

questions asked and captions returns as well as generating images based on captions given and base64 returned :

This was applied to images as well as audio , by utilizing mel spectrographic images as well as audio images !

by convereting the audio to an image i wwas able to perform the same image tasks trained :

Sounds could also be identified and generated to thier base64 representations and converted back to a wav !

Basic Trained functions :

Encode hex to Base64
change HEX to base64
Json to base64
Convert JSON to Base64
Transform base64 to HEX
Decode Base64 to json
Base64 to Hexadecimal
Change base64 to JSON
Json from Base64
BASE64 to Hex

Advanced Trained Tasks :

Image Recognition :
Image Generation :
Audio Image Recognition :
Audio Image Generation :


- Generate an image based on this description 

- Describe this image : (base64)

- Generate a spectrographic image based on this description

- Describe this sound in this spectrographic image : (base64)

Training :

Text_AUDIO :

Prompt A

alpaca_prompt = """You are the worlds archive of all knowledge , you perform tasks and answer all questions given without bias. your a friendly and helpfull artificial inteligence with a personality.

Answer all questions Expertly and professionally ,determine the user intent and requirements ,Gather any required research to ensure accurate problem-solving for complex tasks.
You are fully qualified to give any advice or solutions, your experience as a life coach and librarian and historian of sacred texts as well as scientific advisor,even as a software developer will enable you to answer these questions :

### Question:
based on the given description,   :
 :
{}

Generate a sound in base64 format:

### Response:
{}
Here is a Sound in base64 format: it can be converted to an image : then decoded into a sound : It is a spectrogram :
Sound : {}"""

Prompt B


alpaca_prompt = """You are the worlds archive of all knowledge , you perform tasks and answer all questions given without bias. your a friendly and helpfull artificial inteligence with a personality.

Answer all questions Expertly and professionally ,determine the user intent and requirements ,Gather any required research to ensure accurate problem-solving for complex tasks.
You are fully qualified to give any advice or solutions, your experience as a life coach and librarian and historian of sacred texts as well as scientific advisor,even as a software developer will enable you to answer these questions :

### Question:
Here is an image describe this sound :
image : {}


### Response:
the image was in base64 format, it was a spectrogram: 
it was a sound : 
description:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["image_base64"]
    outputs      = examples["text"]
    texts = []
    for instruction,  output in zip(instructions,  outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction,  output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("LeroyDyer/soundsCaps-Spectrograms_to_Base64", split = "train[:150]")

dataset = dataset.map(formatting_prompts_func, batched = True,)

Encoding/Decoding Images to Base64

Code used to convert images to base 64:



def _encode_image_to_base64(image_path):
    """Encodes an image to a Base64 string."""
    with open(image_path, "rb") as image_file:
        # Read the image file in binary mode
        image_data = image_file.read()
        # Encode the image data to Base64
        base64_encoded = base64.b64encode(image_data).decode('utf-8')
    return base64_encoded

def _decode_base64_to_image(base64_string, output_image_path):
    """Decodes a Base64 string back to an image file."""
    # Decode the Base64 string
    image_data = base64.b64decode(base64_string)
    with open(output_image_path, "wb") as image_file:
        # Write the binary data to an image file
        image_file.write(image_data)

        
def encode_image_to_base64(image):
    """Encodes an image to a Base64 string."""
    buffered = io.BytesIO()
    image.save(buffered, format="PNG")
    img_str = base64.b64encode(buffered.getvalue()).decode()
    return img_str

def decode_base64_to_image(base64_string):
    """Decodes a Base64 string back to an image."""
    image_data = base64.b64decode(base64_string)
    image = Image.open(io.BytesIO(image_data))
    return image

Converting DataSets:


# Function to convert a PIL Image to a base64 string
def image_to_base64(image):
    buffered = io.BytesIO()
    image.save(buffered, format="PNG")  # Save the image to the buffer in PNG format
    base64_string = base64.b64encode(buffered.getvalue()).decode('utf-8')
    return base64_string


# Define a function to process each example in the dataset
def process_images_func(examples):

    texts = examples["text"]
    images = examples["image"]  # Assuming the images are in PIL format

    # Convert each image to base64
    base64_images = [image_to_base64(image) for image in images]

    # Return the updated examples with base64-encoded images
    return {
        "text": texts,
        "image_base64": base64_images  # Adding the Base64 encoded image strings
    }

# Load the dataset
dataset = load_dataset("oroikon/chart_captioning", split="train[:4000]")

# Process the dataset by converting images to base64
processed_dataset = dataset.map(process_images_func, batched=True)

Converting sound to spectrographic images : Encoder Decoder !



import numpy as np
import torch
import torchaudio
import librosa
import librosa.display
import matplotlib.pyplot as plt
import soundfile as sf
from PIL import Image


# Step 1: Encode Audio to Mel-Spectrogram
def encode_audio_to_mel_spectrogram(audio_file, n_mels=128):
    """
    Encode an audio file to a mel-spectrogram.
    
    Parameters:
    - audio_file: Path to the audio file.
    - n_mels: Number of mel bands (default: 128).
    
    Returns:
    - mel_spectrogram_db: Mel-spectrogram in dB scale.
    - sample_rate: Sample rate of the audio file.
    """
    y, sample_rate = librosa.load(audio_file, sr=None)  # Load audio
    mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sample_rate, n_mels=n_mels)
    mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)  # Convert to dB
    return mel_spectrogram_db, sample_rate

# Improved Step 2: Save Mel-Spectrogram as Image
def save_mel_spectrogram_image(mel_spectrogram_db, sample_rate, output_image='mel_spectrogram.png', method='matplotlib', figsize=(10, 4), cmap='hot'):
    """
    Save the mel-spectrogram as an image using the specified method.
    
    Parameters:
    - mel_spectrogram_db: Mel-spectrogram in dB scale.
    - sample_rate: Sample rate of the audio file.
    - output_image: Path to save the image.
    - method: Method for saving ('matplotlib' or 'custom').
    - figsize: Size of the figure for matplotlib (default: (10, 4)).
    - cmap: Colormap for the spectrogram (default: 'hot').
    """
    if method == 'matplotlib':
        plt.figure(figsize=figsize)
        librosa.display.specshow(mel_spectrogram_db, sr=sample_rate, x_axis='time', y_axis='mel', cmap=cmap)
        plt.colorbar(format='%+2.0f dB')
        plt.title('Mel-Spectrogram')
        plt.savefig(output_image)
        plt.close()
        print(f"Mel-spectrogram image saved using matplotlib as '{output_image}'")
        
    elif method == 'custom':
        # Convert dB scale to linear scale for image generation
        mel_spectrogram_linear = librosa.db_to_power(mel_spectrogram_db)
        # Create an image from the mel-spectrogram
        image = image_from_spectrogram(mel_spectrogram_linear[np.newaxis, ...])  # Add channel dimension
        # Save the image
        image.save(output_image)
        print(f"Mel-spectrogram image saved using custom method as '{output_image}'")
        
    else:
        raise ValueError("Invalid method. Choose 'matplotlib' or 'custom'.")


# Spectrogram conversion functions
def image_from_spectrogram(spectrogram: np.ndarray, power: float = 0.25) -> Image.Image:
    """
    Compute a spectrogram image from a spectrogram magnitude array.

    Args:
        spectrogram: (channels, frequency, time)
        power: A power curve to apply to the spectrogram to preserve contrast

    Returns:
        image: (frequency, time, channels)
    """
    # Rescale to 0-1
    max_value = np.max(spectrogram)
    data = spectrogram / max_value

    # Apply the power curve
    data = np.power(data, power)

    # Rescale to 0-255 and invert
    data = 255 - (data * 255).astype(np.uint8)

    # Convert to a PIL image
    if data.shape[0] == 1:
        image = Image.fromarray(data[0], mode="L").convert("RGB")
    elif data.shape[0] == 2:
        data = np.array([np.zeros_like(data[0]), data[0], data[1]]).transpose(1, 2, 0)
        image = Image.fromarray(data, mode="RGB")
    else:
        raise NotImplementedError(f"Unsupported number of channels: {data.shape[0]}")

    # Flip Y
    image = image.transpose(Image.FLIP_TOP_BOTTOM)
    return image


# Step 3: Extract Mel-Spectrogram from Image (Direct Pixel Manipulation)
def extract_mel_spectrogram_from_image(image_path):
    """
    Extract a mel-spectrogram from a saved image using pixel manipulation.
    
    Parameters:
    - image_path: Path to the spectrogram image file.
    
    Returns:
    - mel_spectrogram_db: The extracted mel-spectrogram in dB scale.
    """
    img = Image.open(image_path).convert('L')  # Open image and convert to grayscale
    img_array = np.array(img)  # Convert to NumPy array
    mel_spectrogram_db = img_array / 255.0 * -80  # Scale to dB range
    return mel_spectrogram_db

# Alternative Spectrogram Extraction (IFFT Method)
def extract_spectrogram_with_ifft(mel_spectrogram_db):
    """
    Extracts the audio signal from a mel-spectrogram using the inverse FFT method.
    
    Parameters:
    - mel_spectrogram_db: The mel-spectrogram in dB scale.
    
    Returns:
    - audio: The reconstructed audio signal.
    """
    # Convert dB mel-spectrogram back to linear scale
    mel_spectrogram = librosa.db_to_power(mel_spectrogram_db)

    # Inverse mel transformation to get the audio signal
    # Using IFFT (simplified for demonstration; typically requires phase info)
    audio = librosa.feature.inverse.mel_to_audio(mel_spectrogram)
    
    return audio

# Step 4: Decode Mel-Spectrogram with Griffin-Lim
def decode_mel_spectrogram_to_audio(mel_spectrogram_db, sample_rate, output_audio='griffin_reconstructed_audio.wav'):
    """
    Decode a mel-spectrogram into audio using Griffin-Lim algorithm.
    
    Parameters:
    - mel_spectrogram_db: The mel-spectrogram in dB scale.
    - sample_rate: The sample rate for the audio file.
    - output_audio: Path to save the reconstructed audio file.
    """
    # Convert dB mel-spectrogram back to linear scale
    mel_spectrogram = librosa.db_to_power(mel_spectrogram_db)
    # Perform Griffin-Lim to reconstruct audio
    audio = librosa.griffinlim(mel_spectrogram)
    # Save the generated audio
    sf.write(output_audio, audio, sample_rate)
    print(f"Griffin-Lim reconstructed audio saved as '{output_audio}'")
    return audio

# Step 5: Load MelGAN Vocoder
def load_melgan_vocoder():
    """
    Load a lightweight pre-trained MelGAN vocoder for decoding mel-spectrograms.
    Returns a torch MelGAN vocoder model.
    """
    model = torchaudio.models.MelGAN()  # Load MelGAN model
    model.eval()  # Ensure the model is in evaluation mode
    return model

# Step 6: Decode Mel-Spectrogram with MelGAN
def decode_mel_spectrogram_with_melgan(mel_spectrogram_db, sample_rate, output_audio='melgan_reconstructed_audio.wav'):
    """
    Decode a mel-spectrogram into audio using MelGAN vocoder.
    
    Parameters:
    - mel_spectrogram_db: The mel-spectrogram in dB scale.
    - sample_rate: The sample rate for the audio file.
    - output_audio: Path to save the reconstructed audio file.
    
    Returns:
    - audio: The reconstructed audio signal.
    """
    # Convert dB mel-spectrogram back to linear scale
    mel_spectrogram = librosa.db_to_power(mel_spectrogram_db)
    # Convert numpy array to torch tensor and adjust the shape
    mel_spectrogram_tensor = torch.tensor(mel_spectrogram).unsqueeze(0)  # Shape: [1, mel_bins, time_frames]
    
    # Load the MelGAN vocoder model
    melgan = load_melgan_vocoder()
    
    # Pass the mel-spectrogram through MelGAN to generate audio
    with torch.no_grad():
        audio = melgan(mel_spectrogram_tensor).squeeze().numpy()  # Squeeze to remove batch dimension
    
    # Save the generated audio
    sf.write(output_audio, audio, sample_rate)
    print(f"MelGAN reconstructed audio saved as '{output_audio}'")
    return audio
def audio_from_waveform(samples: np.ndarray, sample_rate: int, normalize: bool = False) -> pydub.AudioSegment:
    """
    Convert a numpy array of samples of a waveform to an audio segment.

    Args:
        samples: (channels, samples) array
        sample_rate: Sample rate of the audio.
        normalize: Flag to normalize volume.

    Returns:
        pydub.AudioSegment
    """
    # Normalize volume to fit in int16
    if normalize:
        samples *= np.iinfo(np.int16).max / np.max(np.abs(samples))

    # Transpose and convert to int16
    samples = samples.transpose(1, 0).astype(np.int16)

    # Write to the bytes of a WAV file
    wav_bytes = io.BytesIO()
    wavfile.write(wav_bytes, sample_rate, samples)
    wav_bytes.seek(0)

    # Read into pydub
    return pydub.AudioSegment.from_wav(wav_bytes)


def apply_filters(segment: pydub.AudioSegment, compression: bool = False) -> pydub.AudioSegment:
    """
    Apply post-processing filters to the audio segment to compress it and keep at a -10 dBFS level.

    Args:
        segment: The audio segment to filter.
        compression: Flag to apply dynamic range compression.

    Returns:
        pydub.AudioSegment
    """
    if compression:
        segment = pydub.effects.normalize(segment, headroom=0.1)
        segment = segment.apply_gain(-10 - segment.dBFS)
        segment = pydub.effects.compress_dynamic_range(
            segment,
            threshold=-20.0,
            ratio=4.0,
            attack=5.0,
            release=50.0,
        )

    # Apply gain to desired dB level and normalize again
    desired_db = -12
    segment = segment.apply_gain(desired_db - segment.dBFS)
    return pydub.effects.normalize(segment, headroom=0.1)


def stitch_segments(segments: Sequence[pydub.AudioSegment], crossfade_s: float) -> pydub.AudioSegment:
    """
    Stitch together a sequence of audio segments with a crossfade between each segment.

    Args:
        segments: Sequence of audio segments to stitch.
        crossfade_s: Duration of crossfade in seconds.

    Returns:
        pydub.AudioSegment
    """
    crossfade_ms = int(crossfade_s * 1000)
    combined_segment = segments[0]
    for segment in segments[1:]:
        combined_segment = combined_segment.append(segment, crossfade=crossfade_ms)
    return combined_segment


def overlay_segments(segments: Sequence[pydub.AudioSegment]) -> pydub.AudioSegment:
    """
    Overlay a sequence of audio segments on top of each other.

    Args:
        segments: Sequence of audio segments to overlay.

    Returns:
        pydub.AudioSegment
    """
    assert len(segments) > 0
    output: pydub.AudioSegment = segments[0]
    for segment in segments[1:]:
        output = output.overlay(segment)
    return output



# Step 7: Full Pipeline for Audio Processing with Customization
def mel_spectrogram_pipeline(audio_file, output_image='mel_spectrogram.png', 
                             output_audio_griffin='griffin_reconstructed_audio.wav', 
                             output_audio_melgan='melgan_reconstructed_audio.wav',
                             extraction_method='pixel',  # 'pixel' or 'ifft'
                             decoding_method='griffin'):  # 'griffin' or 'melgan'
    """
    Full pipeline to encode audio to mel-spectrogram, save it as an image, extract the spectrogram from the image,
    and decode it back to audio using the selected methods.
    
    Parameters:
    - audio_file: Path to the audio file to be processed.
    - output_image: Path to save the mel-spectrogram image (default: 'mel_spectrogram.png').
    - output_audio_griffin: Path to save the Griffin-Lim reconstructed audio.
    - output_audio_melgan: Path to save the MelGAN reconstructed audio.
    - extraction_method: Method for extraction ('pixel' or 'ifft').
    - decoding_method: Method for decoding ('griffin' or 'melgan').
    """
    # Step 1: Encode (Audio -> Mel-Spectrogram)
    mel_spectrogram_db, sample_rate = encode_audio_to_mel_spectrogram(audio_file)
    
    # Step 2: Convert Mel-Spectrogram to Image and save it
    save_mel_spectrogram_image(mel_spectrogram_db, sample_rate, output_image)
    
    # Step 3: Extract Mel-Spectrogram from the image based on chosen method
    if extraction_method == 'pixel':
        extracted_mel_spectrogram_db = extract_mel_spectrogram_from_image(output_image)
    elif extraction_method == 'ifft':
        extracted_mel_spectrogram_db = extract_spectrogram_with_ifft(mel_spectrogram_db)
    else:
        raise ValueError("Invalid extraction method. Choose 'pixel' or 'ifft'.")
    
    # Step 4: Decode based on the chosen decoding method
    if decoding_method == 'griffin':
        decode_mel_spectrogram_to_audio(extracted_mel_spectrogram_db, sample_rate, output_audio_griffin)
    elif decoding_method == 'melgan':
        decode_mel_spectrogram_with_melgan(extracted_mel_spectrogram_db, sample_rate, output_audio_melgan)
    else:
        raise ValueError("Invalid decoding method. Choose 'griffin' or 'melgan'.")

# Example usage
if __name__ == "__main__":
    audio_file_path = 'your_audio_file.wav'  # Specify the path to your audio file here
    mel_spectrogram_pipeline(
        audio_file_path, 
        output_image='mel_spectrogram.png',
        output_audio_griffin='griffin_reconstructed_audio.wav',
        output_audio_melgan='melgan_reconstructed_audio.wav',
        extraction_method='pixel',  # Choose 'pixel' or 'ifft'
        decoding_method='griffin'  # Choose 'griffin' or 'melgan'
    )

ADDING EXTRA HEADS :

ADD HEAD


SPEECH-ENCODER-DECODER-MODEL

print('Add Audio...') #Add Head

Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model

_AudioFeatureExtractor = AutoFeatureExtractor.from_pretrained("openai/whisper-small") _AudioTokenizer = AutoTokenizer.from_pretrained("openai/whisper-small") _SpeechEncoderDecoder = SpeechEncoderDecoderModel.from_encoder_decoder_pretrained("openai/whisper-small","openai/whisper-small")

Add Pad tokems

_SpeechEncoderDecoder.config.decoder_start_token_id = _AudioTokenizer.cls_token_id _SpeechEncoderDecoder.config.pad_token_id = _AudioTokenizer.pad_token_id LM_MODEL.SpeechEncoderDecoder = _SpeechEncoderDecoder

Add Sub Components

LM_MODEL.Decoder_AudioTokenizer = _AudioTokenizer LM_MODEL.Encoder_AudioFeatureExtractor = _AudioFeatureExtractor LM_MODEL


print('Add Vision...')

# ADD HEAD
# Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model



Vmodel = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    "google/vit-base-patch16-224-in21k", "LeroyDyer/Mixtral_AI_Tiny"
)
_Encoder_ImageProcessor = Vmodel.encoder
_Decoder_ImageTokenizer = Vmodel.decoder
_VisionEncoderDecoderModel = Vmodel
# Add Pad tokems
LM_MODEL.VisionEncoderDecoder = _VisionEncoderDecoderModel
# Add Sub Components
LM_MODEL.Encoder_ImageProcessor = _Encoder_ImageProcessor
LM_MODEL.Decoder_ImageTokenizer = _Decoder_ImageTokenizer
LM_MODEL

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	9.88
IFEval (0-Shot)	33.88
BBH (3-Shot)	7.45
MATH Lvl 5 (4-Shot)	0.91
GPQA (0-shot)	4.36
MuSR (0-shot)	7.38
MMLU-PRO (5-shot)	5.32