S-Dreamer's picture
Update README.md
ef4c368 verified

A newer version of the Gradio SDK is available: 5.25.2

Upgrade
metadata
title: Salesforce CodeT5 Large Demo
emoji: 
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 5.24.0
app_file: app.py
pinned: false
license: apache-2.0
datasets:
  - CodeSearchNet/codesearchnet_python
  - bigcode/the-stack-dedup
  - codeparrot/codeparrot-clean
  - openai_humaneval
  - google/mbpp
  - nvidia/OpenCodeReasoning
hf_oauth: true
hf_oauth_scopes:
  - inference-api
short_description: Using the powerful Salesforce CodeT5-large model

⚡ Salesforce CodeT5-large Demo ⚡

Welcome! This repository/Hugging Face Space hosts a demonstration application for the powerful Salesforce CodeT5-large model. It showcases the model's capabilities in various code intelligence tasks using a Gradio interface.

About CodeT5-large

CodeT5 is an advanced encoder-decoder transformer model pre-trained on a vast collection of source code from multiple programming languages alongside natural language text. The codet5-large variant excels at tasks such as:

  • Code Generation: Creating code snippets from natural language descriptions (e.g., comments, docstrings).
  • Code Summarization: Generating concise natural language summaries for given code blocks.
  • Code Translation: Translating code from one programming language to another.
  • Code Refinement: Improving code quality, fixing bugs, or optimizing code.

Using the Demo (Hugging Face Space)

This application is built with Gradio, providing an interactive web UI.

  1. Access the Space: Navigate to the Hugging Face Space hosting this demo.
  2. Interact: Use the input fields provided by the Gradio interface (app.py) to interact with the model.
    • (Example: You might enter a Python docstring in one box to get the generated function body in another, or input code to get a summary. Please update this section with specific instructions based on your app.py functionality!)
  3. Observe: See the results generated by the CodeT5-large model in the output fields.

Running Locally (GitHub / Manual Setup)

If you prefer to run this demo on your local machine:

  1. Clone the Repository:

    git clone <repository_url> # Replace with HF Space or GitHub repo URL
    cd <repository_directory>
    
  2. Set up Environment: (Optional but recommended) Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate # Linux/macOS
    # venv\Scripts\activate # Windows
    
  3. Install Dependencies: Ensure you have Python 3 installed. You'll need Gradio and the necessary libraries for CodeT5 (like transformers and torch). Create a requirements.txt file if one doesn't exist:

    # requirements.txt
    gradio==5.23.3
    transformers
    torch
    # Add any other specific libraries your app.py needs
    

    Then install:

    pip install -r requirements.txt
    
  4. Run the Application:

    python app.py
    
  5. Access Locally: Open your web browser and navigate to the URL provided (typically http://127.0.0.1:7860).

Fine-tuning Datasets for Python & Logic

The CodeT5 model's performance on specific Python tasks or logical reasoning can be enhanced through fine-tuning. Here are some recommended datasets included in the metadata:

  • CodeSearchNet (Python): Excellent for tasks involving matching natural language queries to relevant Python code snippets.
  • The Stack (Deduped): A massive, permissively licensed dataset. Filter for Python files (lang:python) for broad fine-tuning on diverse Python code.
  • CodeParrot (Clean): A high-quality dataset specifically curated for Python code generation tasks.
  • HumanEval: A benchmark dataset consisting of Python function programming problems defined by docstrings, ideal for fine-tuning code generation based on specifications and evaluating functional correctness.
  • MBPP (Mostly Basic Python Problems): Contains around 1,000 crowd-sourced Python programming problems focused on basic concepts, useful for improving generation from descriptions and simple logical problem-solving.

License

This project and the underlying CodeT5 model are distributed under the terms of the Apache License 2.0. Please refer to the LICENSE file for details.