Omarrran's picture
Update README.md
796abc2 verified
metadata
license: apache-2.0
language:
  - ks
tags:
  - Text

Kashmir Text Generation Model

Model Overview

This is a transformer-based text generation model designed for Kashmiri language text generation. The model uses a decoder-only architecture with positional encoding and self-attention mechanisms.

TRY LIVE DEMO ON SPACES

VIEW HERE (Click)

image/png

TRY LIVE DEMO ON SPACES

VIEW HERE (Click)

Intended Use

  • Primary Use: Generating coherent Kashmiri text continuations from given prompts
  • Intended Users: Researchers and developers working with Kashmiri language processing
  • Out-of-Scope Uses: Not intended for production deployment without further evaluation

Model Architecture

  • Type: Decoder-only Transformer
  • Components:
    • Positional Encoding
    • Embedding Layer
    • Transformer Decoder Layers
    • Linear Output Layer
  • Implementation: PyTorch

This is a custom transformer-based text generation model for Kashmiri language.

Model Details

  • Architecture: Custom Transformer Decoder
  • Vocabulary Size: 36100
  • Embedding Dimension: 256
  • Number of Layers: 4
  • Number of Attention Heads: 8
  • Sequence Length: 64
  • Training Data: Kashmiri text corpus

Technical Specifications

  • Framework: PyTorch
  • Input: Text prompts in Kashmiri
  • Output: Generated text continuation
  • Model Parameters:
    • Embedding Dimension: Specified in model_config.json
    • Number of Layers: Specified in model_config.json
    • Number of Attention Heads: Specified in model_config.json
    • Sequence Length: Specified in model_config.json
    • Dropout Rate: 0.2

Files Structure

├── root /
│   ├── model.pt              # Trained model weights
│   ├── word_to_int.json      # Word to integer mapping
│   ├── int_to_word.json      # Integer to word mapping
│   └── model_config.json     # Model configuration

NOTE

  1. Ensure all required files are present in the root directory

Setup in Google Colab

  1. Create a new Google Colab notebook
  2. Copy and paste the following code into a code cell:
    !git clone https://huggingface.co/Omarrran/Kashur_gpt_version_1
    

Required Files

The model requires the following files which will be downloaded from the HuggingFace repository:

  • model.pt: The trained model weights
  • model_config.json: Model configuration parameters
  • word_to_int.json: Vocabulary mapping from words to integers
  • int_to_word.json: Vocabulary mapping from integers to words

NOTE

  1. Ensure all required files are present in the root directory
import os
import shutil

# Define the source and destination paths
source_path = "/content/Kashur_gpt_version_1/"
destination_path = "/content/"

# Loop through all files in the source directory and move them to the destination
for filename in os.listdir(source_path):
    file_path = os.path.join(source_path, filename)
    if os.path.isfile(file_path):
        shutil.move(file_path, destination_path)

print(f"All files from {source_path} moved to {destination_path}")

Usage

1. Import Required Libraries

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import json
import os

2. Device configuration

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

class PositionalEncoding(nn.Module):
    def __init__(self, max_len, d_model, dropout=0.1):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

class TextGen(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_layers, num_heads, sequence_length):
        super(TextGen, self).__init__()
        self.pos_encoder = PositionalEncoding(max_len=sequence_length, d_model=embed_dim)
        self.emb = nn.Embedding(vocab_size, embed_dim)
        self.decoder_layer = nn.TransformerDecoderLayer(d_model=embed_dim, nhead=num_heads, batch_first=True)
        self.decoder = nn.TransformerDecoder(decoder_layer=self.decoder_layer, num_layers=num_layers)
        self.linear = nn.Linear(embed_dim, vocab_size)
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        emb = self.emb(x)
        input_mask = generate_square_subsequent_mask(x.size(1)).to(x.device)
        x = self.pos_encoder(emb)
        x = self.decoder(x, memory=x, tgt_mask=input_mask, memory_mask=input_mask)
        x = self.dropout(x)
        out = self.linear(x)
        return out

def load_model():
    # Load configuration
    with open('model_config.json', 'r') as f:
        config = json.load(f)

    # Load vocabularies
    with open('word_to_int.json', 'r', encoding='utf-8') as f:
        word_to_int = json.load(f)
    with open('int_to_word.json', 'r', encoding='utf-8') as f:
        int_to_word = json.load(f)

    # Initialize model
    model = TextGen(
        vocab_size=config['vocab_size'],
        embed_dim=config['embed_dim'],
        num_layers=config['num_layers'],
        num_heads=config['num_heads'],
        sequence_length=config['sequence_length']
    ).to(device)

    # Load model weights
    model.load_state_dict(torch.load('model.pt', map_location=device))
    model.eval()

    return model, word_to_int, int_to_word, config['sequence_length']

@torch.no_grad()
def generate_text(model, prompt, word_to_int, int_to_word, sequence_length, max_length=100, temperature=1.0):
    model.eval()
    words = prompt.split()
    current_seq = torch.LongTensor([word_to_int.get(w, 0) for w in words]).unsqueeze(0).to(device)

    for _ in range(max_length):
        if current_seq.size(1) > sequence_length:
            current_seq = current_seq[:, -sequence_length:]

        output = model(current_seq)
        next_token_logits = output[:, -1, :] / temperature
        next_token = torch.multinomial(F.softmax(next_token_logits, dim=-1), num_samples=1)

        current_seq = torch.cat([current_seq, next_token], dim=1)
        next_word = int_to_word.get(str(next_token.item()), "<UNK>")
        words.append(next_word)

        if next_word == ".":
            break

    return " ".join(words)

if __name__ == "__main__":
    # Load model and required files
    model, word_to_int, int_to_word, sequence_length = load_model()

Load the Model

The model will automatically load after running the provided code above. It uses either GPU (if available) or CPU.

3. Generate Text

To generate text, use the following format:

# Example prompt (in Kashmiri)
prompt = " دِتم مصمت۔یم بگُل غلام چھُ آں تس اکھ حمزہ گویی"   # Replace With your Kashmiri text prompt

generated_text = generate_text(
    model, 
    prompt, 
    word_to_int, 
    int_to_word,
    sequence_length, 
    max_length=100  # Adjust this value to control output length
)
print(f"Generated text: {generated_text}")

Parameters

You can adjust the following parameters for text generation:

  • max_length: Maximum number of words to generate (default: 100)
  • temperature: Controls randomness in generation (default: 1.0)
    • Higher values (>1.0) make the output more random
    • Lower values (<1.0) make the output more focused and deterministic

Generation Parameters

  • Temperature: Controls randomness in generation (default: 1.0)
    • Higher values (>1.0) result in more diverse outputs
    • Lower values (<1.0) make the output more deterministic
  • Max Length: Maximum number of tokens to generate (default: 100)
  • Sequence Length: Maximum context window size (specified in config)

Limitations

  • The model operates at word-level tokenization
  • Limited by the maximum sequence length specified in the configuration
  • Generation stops at the first period (.) encountered
  • Performance may vary based on input prompt quality and length

Performance Considerations

  • Runs on both CPU and CUDA-enabled GPUs
  • Memory usage scales with sequence length and batch size
  • Inference speed depends on hardware capabilities and generation parameters

Dependencies

  • Python 3.6+
  • PyTorch
  • Math
  • JSON
  • OS

License

[See above card]

Citation

If you use this model in your research, please cite:

@misc{{kashmiri_text_gen,
  author = {{Haq Nawaz Malik}},
  title = {{Kashmiri Text Generation Model}},
  year = {{2024}},
  journal = {{for Preprint}},
  howpublished = {{\\url{{https://huggingface.co/Omarrran/kashmiri_text_gen_model}}}}
}}

Contact

[Add contact information for model maintainers]

Updates and Maintenance

  • Version: 1.0
  • Last Updated: [26-10-2024]
  • [Working to make an updated version]