Spaces:
No application file
No application file
# Introduction | |
<div> | |
<a target="_blank" href="https://discord.gg/Es5qTB9BcN"> | |
<img alt="Discord" src="https://img.shields.io/discord/1214047546020728892?color=%23738ADB&label=Discord&logo=discord&logoColor=white&style=flat-square"/> | |
</a> | |
<a target="_blank" href="http://qm.qq.com/cgi-bin/qm/qr?_wv=1027&k=jCKlUP7QgSm9kh95UlBoYv6s1I-Apl1M&authKey=xI5ttVAp3do68IpEYEalwXSYZFdfxZSkah%2BctF5FIMyN2NqAa003vFtLqJyAVRfF&noverify=0&group_code=593946093"> | |
<img alt="QQ" src="https://img.shields.io/badge/QQ Group-%2312B7F5?logo=tencent-qq&logoColor=white&style=flat-square"/> | |
</a> | |
<a target="_blank" href="https://hub.docker.com/r/fishaudio/fish-speech"> | |
<img alt="Docker" src="https://img.shields.io/docker/pulls/fishaudio/fish-speech?style=flat-square&logo=docker"/> | |
</a> | |
</div> | |
!!! warning | |
We assume no responsibility for any illegal use of the codebase. Please refer to the local laws regarding DMCA (Digital Millennium Copyright Act) and other relevant laws in your area. <br/> | |
This codebase and all models are released under the CC-BY-NC-SA-4.0 license. | |
<p align="center"> | |
<img src="../assets/figs/diagram.png" width="75%"> | |
</p> | |
## Requirements | |
- GPU Memory: 4GB (for inference), 8GB (for fine-tuning) | |
- System: Linux, Windows | |
## Windows Setup | |
Professional Windows users may consider using WSL2 or Docker to run the codebase. | |
```bash | |
# Create a python 3.10 virtual environment, you can also use virtualenv | |
conda create -n fish-speech python=3.10 | |
conda activate fish-speech | |
# Install pytorch | |
pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121 | |
# Install fish-speech | |
pip3 install -e . | |
# (Enable acceleration) Install triton-windows | |
pip install https://github.com/AnyaCoder/fish-speech/releases/download/v0.1.0/triton_windows-0.1.0-py3-none-any.whl | |
``` | |
Non-professional Windows users can consider the following basic methods to run the project without a Linux environment (with model compilation capabilities, i.e., `torch.compile`): | |
1. Extract the project package. | |
2. Click `install_env.bat` to install the environment. | |
3. If you want to enable compilation acceleration, follow this step: | |
1. Download the LLVM compiler from the following links: | |
- [LLVM-17.0.6 (Official Site Download)](https://huggingface.co/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true) | |
- [LLVM-17.0.6 (Mirror Site Download)](https://hf-mirror.com/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true) | |
- After downloading `LLVM-17.0.6-win64.exe`, double-click to install, select an appropriate installation location, and most importantly, check the `Add Path to Current User` option to add the environment variable. | |
- Confirm that the installation is complete. | |
2. Download and install the Microsoft Visual C++ Redistributable to solve potential .dll missing issues: | |
- [MSVC++ 14.40.33810.0 Download](https://aka.ms/vs/17/release/vc_redist.x64.exe) | |
3. Download and install Visual Studio Community Edition to get MSVC++ build tools and resolve LLVM's header file dependencies: | |
- [Visual Studio Download](https://visualstudio.microsoft.com/zh-hans/downloads/) | |
- After installing Visual Studio Installer, download Visual Studio Community 2022. | |
- As shown below, click the `Modify` button and find the `Desktop development with C++` option to select and download. | |
4. Download and install [CUDA Toolkit 12.x](https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Windows&target_arch=x86_64) | |
4. Double-click `start.bat` to open the training inference WebUI management interface. If needed, you can modify the `API_FLAGS` as prompted below. | |
!!! info "Optional" | |
Want to start the inference WebUI? | |
Edit the `API_FLAGS.txt` file in the project root directory and modify the first three lines as follows: | |
``` | |
--infer | |
# --api | |
# --listen ... | |
... | |
``` | |
!!! info "Optional" | |
Want to start the API server? | |
Edit the `API_FLAGS.txt` file in the project root directory and modify the first three lines as follows: | |
``` | |
# --infer | |
--api | |
--listen ... | |
... | |
``` | |
!!! info "Optional" | |
Double-click `run_cmd.bat` to enter the conda/python command line environment of this project. | |
## Linux Setup | |
See [pyproject.toml](../../pyproject.toml) for details. | |
```bash | |
# Create a python 3.10 virtual environment, you can also use virtualenv | |
conda create -n fish-speech python=3.10 | |
conda activate fish-speech | |
# Install pytorch | |
pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 | |
# (Ubuntu / Debian User) Install sox + ffmpeg | |
apt install libsox-dev ffmpeg | |
# (Ubuntu / Debian User) Install pyaudio | |
apt install build-essential \ | |
cmake \ | |
libasound-dev \ | |
portaudio19-dev \ | |
libportaudio2 \ | |
libportaudiocpp0 | |
# Install fish-speech | |
pip3 install -e .[stable] | |
``` | |
## macos setup | |
If you want to perform inference on MPS, please add the `--device mps` flag. | |
Please refer to [this PR](https://github.com/fishaudio/fish-speech/pull/461#issuecomment-2284277772) for a comparison of inference speeds. | |
!!! warning | |
The `compile` option is not officially supported on Apple Silicon devices, so there is no guarantee that inference speed will improve. | |
```bash | |
# create a python 3.10 virtual environment, you can also use virtualenv | |
conda create -n fish-speech python=3.10 | |
conda activate fish-speech | |
# install pytorch | |
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 | |
# install fish-speech | |
pip install -e .[stable] | |
``` | |
## Docker Setup | |
1. Install NVIDIA Container Toolkit: | |
To use GPU for model training and inference in Docker, you need to install NVIDIA Container Toolkit: | |
For Ubuntu users: | |
```bash | |
# Add repository | |
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ | |
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ | |
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ | |
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list | |
# Install nvidia-container-toolkit | |
sudo apt-get update | |
sudo apt-get install -y nvidia-container-toolkit | |
# Restart Docker service | |
sudo systemctl restart docker | |
``` | |
For users of other Linux distributions, please refer to: [NVIDIA Container Toolkit Install-guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html). | |
2. Pull and run the fish-speech image | |
```shell | |
# Pull the image | |
docker pull fishaudio/fish-speech:latest-dev | |
# Run the image | |
docker run -it \ | |
--name fish-speech \ | |
--gpus all \ | |
-p 7860:7860 \ | |
fishaudio/fish-speech:latest-dev \ | |
zsh | |
# If you need to use a different port, please modify the -p parameter to YourPort:7860 | |
``` | |
3. Download model dependencies | |
Make sure you are in the terminal inside the docker container, then download the required `vqgan` and `llama` models from our huggingface repository. | |
```bash | |
huggingface-cli download fishaudio/fish-speech-1.4 --local-dir checkpoints/fish-speech-1.4 | |
``` | |
4. Configure environment variables and access WebUI | |
In the terminal inside the docker container, enter `export GRADIO_SERVER_NAME="0.0.0.0"` to allow external access to the gradio service inside docker. | |
Then in the terminal inside the docker container, enter `python tools/webui.py` to start the WebUI service. | |
If you're using WSL or MacOS, visit [http://localhost:7860](http://localhost:7860) to open the WebUI interface. | |
If it's deployed on a server, replace localhost with your server's IP. | |
## Changelog | |
- 2024/09/10: Updated Fish-Speech to 1.4 version, with an increase in dataset size and a change in the quantizer's n_groups from 4 to 8. | |
- 2024/07/02: Updated Fish-Speech to 1.2 version, remove VITS Decoder, and greatly enhanced zero-shot ability. | |
- 2024/05/10: Updated Fish-Speech to 1.1 version, implement VITS decoder to reduce WER and improve timbre similarity. | |
- 2024/04/22: Finished Fish-Speech 1.0 version, significantly modified VQGAN and LLAMA models. | |
- 2023/12/28: Added `lora` fine-tuning support. | |
- 2023/12/27: Add `gradient checkpointing`, `causual sampling`, and `flash-attn` support. | |
- 2023/12/19: Updated webui and HTTP API. | |
- 2023/12/18: Updated fine-tuning documentation and related examples. | |
- 2023/12/17: Updated `text2semantic` model, supporting phoneme-free mode. | |
- 2023/12/13: Beta version released, includes VQGAN model and a language model based on LLAMA (phoneme support only). | |
## Acknowledgements | |
- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2) | |
- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2) | |
- [GPT VITS](https://github.com/innnky/gpt-vits) | |
- [MQTTS](https://github.com/b04901014/MQTTS) | |
- [GPT Fast](https://github.com/pytorch-labs/gpt-fast) | |
- [Transformers](https://github.com/huggingface/transformers) | |
- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) | |