Spaces:

stormXT
/

TalkShow

Configuration error

App Files Files Community

mvreddy13 commited on Nov 7, 2024

Commit

f0c7f08

1 Parent(s): ed77e1a

Adding new Folders

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +159 -13
__init__.py +0 -0
config/LS3DCG.json +64 -0
config/body_pixel.json +63 -0
config/body_vq.json +62 -0
config/face.json +59 -0
data_utils/__init__.py +3 -0
data_utils/__pycache__/__init__.cpython-37.pyc +0 -0
data_utils/__pycache__/consts.cpython-37.pyc +0 -0
data_utils/__pycache__/dataloader_torch.cpython-37.pyc +0 -0
data_utils/__pycache__/lower_body.cpython-37.pyc +0 -0
data_utils/__pycache__/mesh_dataset.cpython-37.pyc +0 -0
data_utils/__pycache__/rotation_conversion.cpython-37.pyc +0 -0
data_utils/__pycache__/utils.cpython-37.pyc +0 -0
data_utils/apply_split.py +51 -0
data_utils/axis2matrix.py +29 -0
data_utils/consts.py +0 -0
data_utils/dataloader_torch.py +279 -0
data_utils/dataset_preprocess.py +170 -0
data_utils/get_j.py +51 -0
data_utils/hand_component.json +0 -0
data_utils/lower_body.py +143 -0
data_utils/mesh_dataset.py +348 -0
data_utils/rotation_conversion.py +551 -0
data_utils/split_more_than_2s.pkl +3 -0
data_utils/split_train_val_test.py +27 -0
data_utils/train_val_test.json +0 -0
data_utils/utils.py +318 -0
evaluation/FGD.py +199 -0
evaluation/__init__.py +0 -0
evaluation/__pycache__/__init__.cpython-37.pyc +0 -0
evaluation/__pycache__/metrics.cpython-37.pyc +0 -0
evaluation/diversity_LVD.py +64 -0
evaluation/get_quality_samples.py +62 -0
evaluation/metrics.py +109 -0
evaluation/mode_transition.py +60 -0
evaluation/peak_velocity.py +65 -0
evaluation/util.py +148 -0
losses/__init__.py +1 -0
losses/__pycache__/__init__.cpython-37.pyc +0 -0
losses/__pycache__/losses.cpython-37.pyc +0 -0
losses/losses.py +91 -0
nets/LS3DCG.py +414 -0
nets/__init__.py +8 -0
nets/__pycache__/__init__.cpython-37.pyc +0 -0
nets/__pycache__/base.cpython-37.pyc +0 -0
nets/__pycache__/init_model.cpython-37.pyc +0 -0
nets/__pycache__/layers.cpython-37.pyc +0 -0
nets/__pycache__/smplx_body_pixel.cpython-37.pyc +0 -0
nets/__pycache__/smplx_body_vq.cpython-37.pyc +0 -0

README.md CHANGED Viewed

@@ -1,13 +1,159 @@
----
-title: TalkShow
-emoji: 👀
-colorFrom: yellow
-colorTo: blue
-sdk: streamlit
-sdk_version: 1.40.0
-app_file: app.py
-pinned: false
-license: apache-2.0
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# TalkSHOW: Generating Holistic 3D Human Motion from Speech [CVPR2023]
+The official PyTorch implementation of the **CVPR2023** paper [**"Generating Holistic 3D Human Motion from Speech"**](https://arxiv.org/abs/2212.04420).
+Please visit our [**webpage**](https://talkshow.is.tue.mpg.de/) for more details.
+![teaser](visualise/teaser_01.png)
+## HighLight
+We directly provide the input and our output for the demo data, you can find them in `/demo/` and `/demo_audio/`. TalkSHOW can generalize well on English, French, Songs so far. Looking forward to more demos.
+You can directly use the generated motion to animate your 3D character or your own digital avatar. We will provide more demos, please stay tuned. And we are quite looking forward to your pull request.
+## Notes
+We are using 100 dimension parameters for SMPL-X facial expression, if you need other dimensions parameters, you can use this code to convert.
+```
+https://github.com/yhw-yhw/SHOW/blob/main/cvt_exp_dim_tool.py
+```
+## TODO
+- [x] [🤗Hugging Face Demo](https://huggingface.co/spaces/feifeifeiliu/TalkSHOW)
+- [ ] Animated 2D videos by the generated motion from TalkSHOW.
+## Getting started
+The training code was tested on `Ubuntu 18.04.5 LTS` and the visualization code was test on `Windows 10`, and it requires:
+* Python 3.7
+* conda3 or miniconda3
+* CUDA capable GPU (one is enough)
+### 1. Setup environment
+Clone the repo:
+  ```bash
+  git clone https://github.com/yhw-yhw/TalkSHOW
+  cd TalkSHOW
+  ```
+Create conda environment:
+```bash
+conda create --name talkshow python=3.7
+conda activate talkshow
+```
+Please install pytorch (v1.10.1).
+    pip install -r requirements.txt
+Please install [**MPI-Mesh**](https://github.com/MPI-IS/mesh).
+### 2. Get data
+Please note that if you only want to generate demo videos, you can skip this step and directly download the pretrained models.
+Download [**SHOW_dataset_v1.0.zip**](https://download.is.tue.mpg.de/download.php?domain=talkshow&resume=1&sfile=SHOW_dataset_v1.0.zip) from [**TalkSHOW download webpage**](https://talkshow.is.tue.mpg.de/download.php),
+unzip using ``for i in $(ls *.tar.gz);do tar xvf $i;done``.
+~~Run ``python data_utils/dataset_preprocess.py`` to check and split dataset.
+Modify ``data_root`` in ``config/*.json`` to the dataset-path.~~
+Modify ``data_root`` in ``data_utils/apply_split.py`` to the dataset path and run it to apply ``data_utils/split_more_than_2s.pkl`` to the dataset.
+We will update the benchmark soon.
+### 3. Download the pretrained models (Optional)
+Download [**pretrained models**](https://drive.google.com/file/d/1bC0ZTza8HOhLB46WOJ05sBywFvcotDZG/view?usp=sharing),
+unzip and place it in the TalkSHOW folder, i.e. ``path-to-TalkSHOW/experiments``.
+### 4. Training
+Please note that the process of loading data for the first time can be quite slow. If you have already completed the loading process, setting ``dataset_load_mode`` to ``pickle`` in ``config/[config_name].json`` will make the loading process much faster.
+    # 1. Train VQ-VAEs.
+    bash train_body_vq.sh
+    # 2. Train PixelCNN. Please modify "Model:vq_path" in config/body_pixel.json to the path of VQ-VAEs.
+    bash train_body_pixel.sh
+    # 3. Train face generator.
+    bash train_face.sh
+### 5. Testing
+Modify the arguments in ``test_face.sh`` and ``test_body.sh``. Then
+    bash test_face.sh
+    bash test_body.sh
+### 5. Visualization
+If you ssh into the linux machine, NotImplementedError might occur. In this case, please refer to [**issue**](https://github.com/MPI-IS/mesh/issues/66) for solving the error.
+Download [**smplx model**](https://drive.google.com/file/d/1Ly_hQNLQcZ89KG0Nj4jYZwccQiimSUVn/view?usp=share_link) (Please register in the official [**SMPLX webpage**](https://smpl-x.is.tue.mpg.de) before you use it.)
+and place it in ``path-to-TalkSHOW/visualise/smplx_model``.
+To visualise the test set and generated result (in each video, left: generated result | right: ground truth).
+The videos and generated motion data are saved in ``./visualise/video/body-pixel``:
+    bash visualise.sh
+If you ssh into the linux machine, there might be an error about OffscreenRenderer. In this case, please refer to [**issue**](https://github.com/MPI-IS/mesh/issues/66) for solving the error.
+To reproduce the demo videos, run
+```bash
+# the whole body demo
+python scripts/demo.py --config_file ./config/body_pixel.json --infer --audio_file ./demo_audio/1st-page.wav --id 0 --whole_body
+# the face demo
+python scripts/demo.py --config_file ./config/body_pixel.json --infer --audio_file ./demo_audio/style.wav --id 0 --only_face
+# the identity-specific demo
+python scripts/demo.py --config_file ./config/body_pixel.json --infer --audio_file ./demo_audio/style.wav --id 0
+python scripts/demo.py --config_file ./config/body_pixel.json --infer --audio_file ./demo_audio/style.wav --id 1
+python scripts/demo.py --config_file ./config/body_pixel.json --infer --audio_file ./demo_audio/style.wav --id 2
+python scripts/demo.py --config_file ./config/body_pixel.json --infer --audio_file ./demo_audio/style.wav --id 3 --stand
+# the diversity demo
+python scripts/demo.py --config_file ./config/body_pixel.json --infer --audio_file ./demo_audio/style.wav --id 0 --num_samples 12
+# the french demo
+python scripts/demo.py --config_file ./config/body_pixel.json --infer --audio_file ./demo_audio/french.wav --id 0
+# the synthetic speech demo
+python scripts/demo.py --config_file ./config/body_pixel.json --infer --audio_file ./demo_audio/rich.wav --id 0
+# the song demo
+python scripts/demo.py --config_file ./config/body_pixel.json --infer --audio_file ./demo_audio/song.wav --id 0
+````
+### 6. Baseline
+For training the reproducted "Learning Speech-driven 3D Conversational Gestures from Video" (Habibie et al.), you could run
+```bash
+python -W ignore scripts/train.py --speakers oliver seth conan chemistry --config_file ./config/LS3DCG.json
+```
+For visualization with the pretrained model, download the above [pretrained models](#3-download-the-pretrained-models--optional-) and run
+```bash
+python scripts/demo.py --config_file ./config/LS3DCG.json --infer --audio_file ./demo_audio/style.wav --body_model_name s2g_LS3DCG --body_model_path experiments/2022-10-19-smplx_S2G-LS3DCG/ckpt-99.pth --id 0
+```
+## Citation
+If you find our work useful to your research, please consider citing:
+```
+@inproceedings{yi2022generating,
+  title={Generating Holistic 3D Human Motion from Speech},
+  author={Yi, Hongwei and Liang, Hualin and Liu, Yifei and Cao, Qiong and Wen, Yandong and Bolkart, Timo and Tao, Dacheng and Black, Michael J},
+  booktitle={CVPR},
+  year={2023}
+}
+```
+## Acknowledgements
+For functions or scripts that are based on external sources, we acknowledge the origin individually in each file.
+Here are some great resources we benefit:
+- [Freeform](https://github.com/TheTempAccount/Co-Speech-Motion-Generation) for training pipeline
+- [MPI-Mesh](https://github.com/MPI-IS/mesh), [Pyrender](https://github.com/mmatl/pyrender), [Smplx](https://github.com/vchoutas/smplx), [VOCA](https://github.com/TimoBolkart/voca) for rendering
+- [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) and [Faceformer](https://github.com/EvelynFan/FaceFormer) for audio encoder
+## Contact
+For questions, please contact [email protected] or [email protected] or [email protected] or [email protected]
+For commercial licensing, please contact [email protected]

__init__.py ADDED Viewed

File without changes

config/LS3DCG.json ADDED Viewed

	@@ -0,0 +1,64 @@

+{
+  "config_root_path": "/is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts",
+  "dataset_load_mode": "pickle",
+  "store_file_path": "store.pkl",
+  "smplx_npz_path": "visualise/smplx_model/SMPLX_NEUTRAL_2020.npz",
+  "extra_joint_path": "visualise/smplx_model/smplx_extra_joints.yaml",
+  "j14_regressor_path": "visualise/smplx_model/SMPLX_to_J14.pkl",
+  "param": {
+    "w_j": 1,
+    "w_b": 1,
+    "w_h": 1
+  },
+  "Data": {
+    "data_root": "../ExpressiveWholeBodyDatasetv1.0/",
+    "pklname": "_3d_mfcc.pkl",
+    "whole_video": false,
+    "pose": {
+      "normalization": false,
+      "convert_to_6d": false,
+      "norm_method": "all",
+      "augmentation": false,
+      "generate_length": 88,
+      "pre_pose_length": 0,
+      "pose_dim": 99,
+      "expression": true
+    },
+    "aud": {
+      "feat_method": "mfcc",
+      "aud_feat_dim": 64,
+      "aud_feat_win_size": null,
+      "context_info": false
+    }
+  },
+  "Model": {
+    "model_type": "body",
+    "model_name": "s2g_LS3DCG",
+    "code_num": 2048,
+    "AudioOpt": "Adam",
+    "encoder_choice": "mfcc",
+    "gan": false
+  },
+  "DataLoader": {
+    "batch_size": 128,
+    "num_workers": 0
+  },
+  "Train": {
+    "epochs": 100,
+    "max_gradient_norm": 5,
+    "learning_rate": {
+      "generator_learning_rate": 1e-4,
+      "discriminator_learning_rate": 1e-4
+    },
+    "weights": {
+      "keypoint_loss_weight": 1.0,
+      "gan_loss_weight": 1.0
+    }
+  },
+  "Log": {
+    "save_every": 50,
+    "print_every": 200,
+    "name": "LS3DCG"
+  }
+}

config/body_pixel.json ADDED Viewed

	@@ -0,0 +1,63 @@

+{
+  "config_root_path": "/is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts",
+  "dataset_load_mode": "json",
+  "store_file_path": "store.pkl",
+  "smplx_npz_path": "visualise/smplx_model/SMPLX_NEUTRAL_2020.npz",
+  "extra_joint_path": "visualise/smplx_model/smplx_extra_joints.yaml",
+  "j14_regressor_path": "visualise/smplx_model/SMPLX_to_J14.pkl",
+  "param": {
+    "w_j": 1,
+    "w_b": 1,
+    "w_h": 1
+  },
+  "Data": {
+    "data_root": "../ExpressiveWholeBodyDatasetv1.0/",
+    "pklname": "_3d_mfcc.pkl",
+    "whole_video": false,
+    "pose": {
+      "normalization": false,
+      "convert_to_6d": false,
+      "norm_method": "all",
+      "augmentation": false,
+      "generate_length": 88,
+      "pre_pose_length": 0,
+      "pose_dim": 99,
+      "expression": true
+    },
+    "aud": {
+      "feat_method": "mfcc",
+      "aud_feat_dim": 64,
+      "aud_feat_win_size": null,
+      "context_info": false
+    }
+  },
+  "Model": {
+    "model_type": "body",
+    "model_name": "s2g_body_pixel",
+    "composition": true,
+    "code_num": 2048,
+    "bh_model": true,
+    "AudioOpt": "Adam",
+    "encoder_choice": "mfcc",
+    "gan": false,
+    "vq_path": "./experiments/2022-10-31-smplx_S2G-body-vq-3d/ckpt-99.pth"
+  },
+  "DataLoader": {
+    "batch_size": 128,
+    "num_workers": 0
+  },
+  "Train": {
+    "epochs": 100,
+    "max_gradient_norm": 5,
+    "learning_rate": {
+      "generator_learning_rate": 1e-4,
+      "discriminator_learning_rate": 1e-4
+    }
+  },
+  "Log": {
+    "save_every": 50,
+    "print_every": 200,
+    "name": "body-pixel2"
+  }
+}

config/body_vq.json ADDED Viewed

	@@ -0,0 +1,62 @@

+{
+  "config_root_path": "/is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts",
+  "dataset_load_mode": "json",
+  "store_file_path": "store.pkl",
+  "smplx_npz_path": "visualise/smplx_model/SMPLX_NEUTRAL_2020.npz",
+  "extra_joint_path": "visualise/smplx_model/smplx_extra_joints.yaml",
+  "j14_regressor_path": "visualise/smplx_model/SMPLX_to_J14.pkl",
+  "param": {
+    "w_j": 1,
+    "w_b": 1,
+    "w_h": 1
+  },
+  "Data": {
+    "data_root": "../ExpressiveWholeBodyDatasetv1.0/",
+    "pklname": "_3d_mfcc.pkl",
+    "whole_video": false,
+    "pose": {
+      "normalization": false,
+      "convert_to_6d": false,
+      "norm_method": "all",
+      "augmentation": false,
+      "generate_length": 88,
+      "pre_pose_length": 0,
+      "pose_dim": 99,
+      "expression": true
+    },
+    "aud": {
+      "feat_method": "mfcc",
+      "aud_feat_dim": 64,
+      "aud_feat_win_size": null,
+      "context_info": false
+    }
+  },
+  "Model": {
+    "model_type": "body",
+    "model_name": "s2g_body_vq",
+    "composition": true,
+    "code_num": 2048,
+    "bh_model": true,
+    "AudioOpt": "Adam",
+    "encoder_choice": "mfcc",
+    "gan": false
+  },
+  "DataLoader": {
+    "batch_size": 128,
+    "num_workers": 0
+  },
+  "Train": {
+    "epochs": 100,
+    "max_gradient_norm": 5,
+    "learning_rate": {
+      "generator_learning_rate": 1e-4,
+      "discriminator_learning_rate": 1e-4
+    }
+  },
+  "Log": {
+    "save_every": 50,
+    "print_every": 200,
+    "name": "body-vq"
+  }
+}

config/face.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "config_root_path": "/is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts",
+  "dataset_load_mode": "json",
+  "store_file_path": "store.pkl",
+  "smplx_npz_path": "visualise/smplx_model/SMPLX_NEUTRAL_2020.npz",
+  "extra_joint_path": "visualise/smplx_model/smplx_extra_joints.yaml",
+  "j14_regressor_path": "visualise/smplx_model/SMPLX_to_J14.pkl",
+  "param": {
+    "w_j": 1,
+    "w_b": 1,
+    "w_h": 1
+  },
+  "Data": {
+    "data_root": "../ExpressiveWholeBodyDatasetv1.0/",
+    "pklname": "_3d_wv2.pkl",
+    "whole_video": true,
+    "pose": {
+      "normalization": false,
+      "convert_to_6d": false,
+      "norm_method": "all",
+      "augmentation": false,
+      "generate_length": 88,
+      "pre_pose_length": 0,
+      "pose_dim": 99,
+      "expression": true
+    },
+    "aud": {
+      "feat_method": "mfcc",
+      "aud_feat_dim": 64,
+      "aud_feat_win_size": null,
+      "context_info": false
+    }
+  },
+  "Model": {
+    "model_type": "face",
+    "model_name": "s2g_face",
+    "AudioOpt": "SGD",
+    "encoder_choice": "faceformer",
+    "gan": false
+  },
+  "DataLoader": {
+    "batch_size": 1,
+    "num_workers": 0
+  },
+  "Train": {
+    "epochs": 100,
+    "max_gradient_norm": 5,
+    "learning_rate": {
+      "generator_learning_rate": 1e-4,
+      "discriminator_learning_rate": 1e-4
+    }
+  },
+  "Log": {
+    "save_every": 50,
+    "print_every": 1000,
+    "name": "face"
+  }
+}

data_utils/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@

+# from .dataloader_csv import MultiVidData as csv_data
+from .dataloader_torch import MultiVidData as torch_data
+from .utils import get_melspec, get_mfcc, get_mfcc_old, get_mfcc_psf, get_mfcc_psf_min, get_mfcc_ta

data_utils/__pycache__/__init__.cpython-37.pyc ADDED Viewed

Binary file (375 Bytes). View file

data_utils/__pycache__/consts.cpython-37.pyc ADDED Viewed

Binary file (92.7 kB). View file

data_utils/__pycache__/dataloader_torch.cpython-37.pyc ADDED Viewed

Binary file (5.31 kB). View file

data_utils/__pycache__/lower_body.cpython-37.pyc ADDED Viewed

Binary file (3.91 kB). View file

data_utils/__pycache__/mesh_dataset.cpython-37.pyc ADDED Viewed

Binary file (7.9 kB). View file

data_utils/__pycache__/rotation_conversion.cpython-37.pyc ADDED Viewed

Binary file (16.4 kB). View file

data_utils/__pycache__/utils.cpython-37.pyc ADDED Viewed

Binary file (7.42 kB). View file

data_utils/apply_split.py ADDED Viewed

	@@ -0,0 +1,51 @@

+import os
+from tqdm import tqdm
+import pickle
+import shutil
+speakers = ['seth', 'oliver', 'conan', 'chemistry']
+source_data_root = "../expressive_body-V0.7"
+data_root = "D:/Downloads/SHOW_dataset_v1.0/ExpressiveWholeBodyDatasetReleaseV1.0"
+f_read = open('split_more_than_2s.pkl', 'rb')
+f_save = open('none.pkl', 'wb')
+data_split = pickle.load(f_read)
+none_split = []
+train = val = test = 0
+for speaker_name in speakers:
+    speaker_root = os.path.join(data_root, speaker_name)
+    videos = [v for v in data_split[speaker_name]]
+    for vid in tqdm(videos, desc="Processing training data of {}......".format(speaker_name)):
+        for split in data_split[speaker_name][vid]:
+            for seq in data_split[speaker_name][vid][split]:
+                seq = seq.replace('\\', '/')
+                old_file_path = os.path.join(data_root, speaker_name, vid, seq.split('/')[-1])
+                old_file_path = old_file_path.replace('\\', '/')
+                new_file_path = seq.replace(source_data_root.split('/')[-1], data_root.split('/')[-1])
+                try:
+                    shutil.move(old_file_path, new_file_path)
+                    if split == 'train':
+                        train = train + 1
+                    elif split == 'test':
+                        test = test + 1
+                    elif split == 'val':
+                        val = val + 1
+                except FileNotFoundError:
+                    none_split.append(old_file_path)
+                    print(f"The file {old_file_path} does not exists.")
+                except shutil.Error:
+                    none_split.append(old_file_path)
+                    print(f"The file {old_file_path} does not exists.")
+print(none_split.__len__())
+pickle.dump(none_split, f_save)
+f_save.close()
+print(train, val, test)

data_utils/axis2matrix.py ADDED Viewed

	@@ -0,0 +1,29 @@

+import numpy as np
+import math
+import scipy.linalg as linalg
+def rotate_mat(axis, radian):
+    a = np.cross(np.eye(3), axis / linalg.norm(axis) * radian)
+    rot_matrix = linalg.expm(a)
+    return rot_matrix
+def aaa2mat(axis, sin, cos):
+    i = np.eye(3)
+    nnt = np.dot(axis.T, axis)
+    s = np.asarray([[0, -axis[0,2], axis[0,1]],
+                    [axis[0,2], 0, -axis[0,0]],
+                    [-axis[0,1], axis[0,0], 0]])
+    r = cos * i + (1-cos)*nnt +sin * s
+    return r
+rand_axis = np.asarray([[1,0,0]])
+#旋转角度
+r = math.pi/2
+#返回旋转矩阵
+rot_matrix = rotate_mat(rand_axis, r)
+r2 = aaa2mat(rand_axis, np.sin(r), np.cos(r))
+print(rot_matrix)

data_utils/consts.py ADDED Viewed

The diff for this file is too large to render. See raw diff

data_utils/dataloader_torch.py ADDED Viewed

	@@ -0,0 +1,279 @@

+import sys
+import os
+sys.path.append(os.getcwd())
+import os
+from tqdm import tqdm
+from data_utils.utils import *
+import torch.utils.data as data
+from data_utils.mesh_dataset import SmplxDataset
+from transformers import Wav2Vec2Processor
+class MultiVidData():
+    def __init__(self,
+                data_root,
+                speakers,
+                split='train',
+                limbscaling=False,
+                normalization=False,
+                norm_method='new',
+                split_trans_zero=False,
+                num_frames=25,
+                num_pre_frames=25,
+                num_generate_length=None,
+                aud_feat_win_size=None,
+                aud_feat_dim=64,
+                feat_method='mel_spec',
+                context_info=False,
+                smplx=False,
+                audio_sr=16000,
+                convert_to_6d=False,
+                expression=False,
+                config=None
+                ):
+        self.data_root = data_root
+        self.speakers = speakers
+        self.split = split
+        if split == 'pre':
+            self.split = 'train'
+        self.norm_method=norm_method
+        self.normalization = normalization
+        self.limbscaling = limbscaling
+        self.convert_to_6d = convert_to_6d
+        self.num_frames=num_frames
+        self.num_pre_frames=num_pre_frames
+        if num_generate_length is None:
+            self.num_generate_length = num_frames
+        else:
+            self.num_generate_length = num_generate_length
+        self.split_trans_zero=split_trans_zero
+        dataset = SmplxDataset
+        if self.split_trans_zero:
+            self.trans_dataset_list = []
+            self.zero_dataset_list = []
+        else:
+            self.all_dataset_list = []
+        self.dataset={}
+        self.complete_data=[]
+        self.config=config
+        load_mode=self.config.dataset_load_mode
+        ######################load with pickle file
+        if load_mode=='pickle':
+            import pickle
+            import subprocess
+            # store_file_path='/tmp/store.pkl'
+            # cp /is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts/store.pkl /tmp/store.pkl
+            # subprocess.run(f'cp /is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts/store.pkl {store_file_path}',shell=True)
+            # f = open(self.config.store_file_path, 'rb+')
+            f = open(self.split+config.Data.pklname, 'rb+')
+            self.dataset=pickle.load(f)
+            f.close()
+            for key in self.dataset:
+                self.complete_data.append(self.dataset[key].complete_data)
+        ######################load with pickle file
+        ######################load with a csv file
+        elif load_mode=='csv':
+            # 这里从我的一个code文件夹导入的，后续再完善进来
+            try:
+                sys.path.append(self.config.config_root_path)
+                from config import config_path
+                from csv_parser import csv_parse
+            except ImportError as e:
+                print(f'err: {e}')
+                raise ImportError('config root path error...')
+            for speaker_name in self.speakers:
+                # df_intervals=pd.read_csv(self.config.voca_csv_file_path)
+                df_intervals=None
+                df_intervals=df_intervals[df_intervals['speaker']==speaker_name]
+                df_intervals = df_intervals[df_intervals['dataset'] == self.split]
+                print(f'speaker {speaker_name} train interval length: {len(df_intervals)}')
+                for iter_index, (_, interval) in tqdm(
+                        (enumerate(df_intervals.iterrows())),desc=f'load {speaker_name}'
+                ):
+                    (
+                        interval_index,
+                        interval_speaker,
+                        interval_video_fn,
+                        interval_id,
+                        start_time,
+                        end_time,
+                        duration_time,
+                        start_time_10,
+                        over_flow_flag,
+                        short_dur_flag,
+                        big_video_dir,
+                        small_video_dir_name,
+                        speaker_video_path,
+                        voca_basename,
+                        json_basename,
+                        wav_basename,
+                        voca_top_clip_path,
+                        voca_json_clip_path,
+                        voca_wav_clip_path,
+                        audio_output_fn,
+                        image_output_path,
+                        pifpaf_output_path,
+                        mp_output_path,
+                        op_output_path,
+                        deca_output_path,
+                        pixie_output_path,
+                        cam_output_path,
+                        ours_output_path,
+                        merge_output_path,
+                        multi_output_path,
+                        gt_output_path,
+                        ours_images_path,
+                        pkl_fil_path,
+                    )=csv_parse(interval)
+                    if not os.path.exists(pkl_fil_path) or not os.path.exists(audio_output_fn):
+                        continue
+                    key=f'{interval_video_fn}/{small_video_dir_name}'
+                    self.dataset[key] = dataset(
+                        data_root=pkl_fil_path,
+                        speaker=speaker_name,
+                        audio_fn=audio_output_fn,
+                        audio_sr=audio_sr,
+                        fps=num_frames,
+                        feat_method=feat_method,
+                        audio_feat_dim=aud_feat_dim,
+                        train=(self.split == 'train'),
+                        load_all=True,
+                        split_trans_zero=self.split_trans_zero,
+                        limbscaling=self.limbscaling,
+                        num_frames=self.num_frames,
+                        num_pre_frames=self.num_pre_frames,
+                        num_generate_length=self.num_generate_length,
+                        audio_feat_win_size=aud_feat_win_size,
+                        context_info=context_info,
+                        convert_to_6d=convert_to_6d,
+                        expression=expression,
+                        config=self.config
+                    )
+                    self.complete_data.append(self.dataset[key].complete_data)
+        ######################load with a csv file
+        ######################origin load method
+        elif load_mode=='json':
+            # if self.split == 'train':
+            #     import pickle
+            #     f = open('store.pkl', 'rb+')
+            #     self.dataset=pickle.load(f)
+            #     f.close()
+            #     for key in self.dataset:
+            #         self.complete_data.append(self.dataset[key].complete_data)
+            # else:https://pytorch-tutorial-assets.s3.amazonaws.com/VOiCES_devkit/source-16k/train/sp0307/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav
+            # if config.Model.model_type == 'face':
+            am = Wav2Vec2Processor.from_pretrained("vitouphy/wav2vec2-xls-r-300m-phoneme")
+            am_sr = 16000
+            # else:
+            #     am, am_sr = None, None
+            for speaker_name in self.speakers:
+                speaker_root = os.path.join(self.data_root, speaker_name)
+                videos=[v for v in os.listdir(speaker_root) ]
+                print(videos)
+                haode = huaide = 0
+                for vid in tqdm(videos, desc="Processing training data of {}......".format(speaker_name)):
+                    source_vid=vid
+                    # vid_pth=os.path.join(speaker_root, source_vid, 'images/half', self.split)
+                    vid_pth = os.path.join(speaker_root, source_vid, self.split)
+                    if smplx == 'pose':
+                        seqs = [s for s in os.listdir(vid_pth) if (s.startswith('clip'))]
+                    else:
+                        try:
+                            seqs = [s for s in os.listdir(vid_pth)]
+                        except:
+                            continue
+                    for s in seqs:
+                        seq_root=os.path.join(vid_pth, s)
+                        key = seq_root # correspond to clip******
+                        audio_fname = os.path.join(speaker_root, source_vid, self.split, s, '%s.wav' % (s))
+                        motion_fname = os.path.join(speaker_root, source_vid, self.split, s, '%s.pkl' % (s))
+                        if not os.path.isfile(audio_fname) or not os.path.isfile(motion_fname):
+                            huaide = huaide + 1
+                            continue
+                        self.dataset[key]=dataset(
+                            data_root=seq_root,
+                            speaker=speaker_name,
+                            motion_fn=motion_fname,
+                            audio_fn=audio_fname,
+                            audio_sr=audio_sr,
+                            fps=num_frames,
+                            feat_method=feat_method,
+                            audio_feat_dim=aud_feat_dim,
+                            train=(self.split=='train'),
+                            load_all=True,
+                            split_trans_zero=self.split_trans_zero,
+                            limbscaling=self.limbscaling,
+                            num_frames=self.num_frames,
+                            num_pre_frames=self.num_pre_frames,
+                            num_generate_length=self.num_generate_length,
+                            audio_feat_win_size=aud_feat_win_size,
+                            context_info=context_info,
+                            convert_to_6d=convert_to_6d,
+                            expression=expression,
+                            config=self.config,
+                            am=am,
+                            am_sr=am_sr,
+                            whole_video=config.Data.whole_video
+                        )
+                        self.complete_data.append(self.dataset[key].complete_data)
+                        haode = haode + 1
+                print("huaide:{}, haode:{}".format(huaide, haode))
+            import pickle
+            f = open(self.split+config.Data.pklname, 'wb')
+            pickle.dump(self.dataset, f)
+            f.close()
+        ######################origin load method
+        self.complete_data=np.concatenate(self.complete_data, axis=0)
+        # assert self.complete_data.shape[-1] == (12+21+21)*2
+        self.normalize_stats = {}
+        self.data_mean = None
+        self.data_std = None
+    def get_dataset(self):
+        self.normalize_stats['mean'] = self.data_mean
+        self.normalize_stats['std'] = self.data_std
+        for key in list(self.dataset.keys()):
+            if self.dataset[key].complete_data.shape[0] < self.num_generate_length:
+                continue
+            self.dataset[key].num_generate_length = self.num_generate_length
+            self.dataset[key].get_dataset(self.normalization, self.normalize_stats, self.split)
+            self.all_dataset_list.append(self.dataset[key].all_dataset)
+        if self.split_trans_zero:
+            self.trans_dataset = data.ConcatDataset(self.trans_dataset_list)
+            self.zero_dataset = data.ConcatDataset(self.zero_dataset_list)
+        else:
+            self.all_dataset = data.ConcatDataset(self.all_dataset_list)

data_utils/dataset_preprocess.py ADDED Viewed

	@@ -0,0 +1,170 @@

+import os
+import pickle
+from tqdm import tqdm
+import shutil
+import torch
+import numpy as np
+import librosa
+import random
+speakers = ['seth', 'conan', 'oliver', 'chemistry']
+data_root = "../ExpressiveWholeBodyDatasetv1.0/"
+split = 'train'
+def split_list(full_list,shuffle=False,ratio=0.2):
+    n_total = len(full_list)
+    offset_0 = int(n_total * ratio)
+    offset_1 = int(n_total * ratio * 2)
+    if n_total==0 or offset_1<1:
+        return [],full_list
+    if shuffle:
+        random.shuffle(full_list)
+    sublist_0 = full_list[:offset_0]
+    sublist_1 = full_list[offset_0:offset_1]
+    sublist_2 = full_list[offset_1:]
+    return sublist_0, sublist_1, sublist_2
+def moveto(list, file):
+    for f in list:
+        before, after = '/'.join(f.split('/')[:-1]), f.split('/')[-1]
+        new_path = os.path.join(before, file)
+        new_path = os.path.join(new_path, after)
+        # os.makedirs(new_path)
+        # os.path.isdir(new_path)
+        # shutil.move(f, new_path)
+        #转移到新目录
+        shutil.copytree(f, new_path)
+        #删除原train里的文件
+        shutil.rmtree(f)
+    return None
+def read_pkl(data):
+    betas = np.array(data['betas'])
+    jaw_pose = np.array(data['jaw_pose'])
+    leye_pose = np.array(data['leye_pose'])
+    reye_pose = np.array(data['reye_pose'])
+    global_orient = np.array(data['global_orient']).squeeze()
+    body_pose = np.array(data['body_pose_axis'])
+    left_hand_pose = np.array(data['left_hand_pose'])
+    right_hand_pose = np.array(data['right_hand_pose'])
+    full_body = np.concatenate(
+        (jaw_pose, leye_pose, reye_pose, global_orient, body_pose, left_hand_pose, right_hand_pose), axis=1)
+    expression = np.array(data['expression'])
+    full_body = np.concatenate((full_body, expression), axis=1)
+    if (full_body.shape[0] < 90) or (torch.isnan(torch.from_numpy(full_body)).sum() > 0):
+        return 1
+    else:
+        return 0
+for speaker_name in speakers:
+    speaker_root = os.path.join(data_root, speaker_name)
+    videos = [v for v in os.listdir(speaker_root)]
+    print(videos)
+    haode = huaide = 0
+    total_seqs = []
+    for vid in tqdm(videos, desc="Processing training data of {}......".format(speaker_name)):
+    # for vid in videos:
+        source_vid = vid
+        vid_pth = os.path.join(speaker_root, source_vid)
+        # vid_pth = os.path.join(speaker_root, source_vid, 'images/half', split)
+        t = os.path.join(speaker_root, source_vid, 'test')
+        v = os.path.join(speaker_root, source_vid, 'val')
+        # if os.path.exists(t):
+        #     shutil.rmtree(t)
+        # if os.path.exists(v):
+        #     shutil.rmtree(v)
+        try:
+            seqs = [s for s in os.listdir(vid_pth)]
+        except:
+            continue
+        # if len(seqs) == 0:
+        #     shutil.rmtree(os.path.join(speaker_root, source_vid))
+            # None
+        for s in seqs:
+            quality = 0
+            total_seqs.append(os.path.join(vid_pth,s))
+            seq_root = os.path.join(vid_pth, s)
+            key = seq_root  # correspond to clip******
+            audio_fname = os.path.join(speaker_root, source_vid, s, '%s.wav' % (s))
+            # delete the data without audio or the audio file could not be read
+            if os.path.isfile(audio_fname):
+                try:
+                    audio = librosa.load(audio_fname)
+                except:
+                    # print(key)
+                    shutil.rmtree(key)
+                    huaide = huaide + 1
+                    continue
+            else:
+                huaide = huaide + 1
+                # print(key)
+                shutil.rmtree(key)
+                continue
+            # check motion file
+            motion_fname = os.path.join(speaker_root, source_vid, s, '%s.pkl' % (s))
+            try:
+                f = open(motion_fname, 'rb+')
+            except:
+                shutil.rmtree(key)
+                huaide = huaide + 1
+                continue
+            data = pickle.load(f)
+            w = read_pkl(data)
+            f.close()
+            quality = quality + w
+            if w == 1:
+                shutil.rmtree(key)
+                # print(key)
+                huaide = huaide + 1
+                continue
+            haode = haode + 1
+    print("huaide:{}, haode:{}, total_seqs:{}".format(huaide, haode, total_seqs.__len__()))
+for speaker_name in speakers:
+    speaker_root = os.path.join(data_root, speaker_name)
+    videos = [v for v in os.listdir(speaker_root)]
+    print(videos)
+    haode = huaide = 0
+    total_seqs = []
+    for vid in tqdm(videos, desc="Processing training data of {}......".format(speaker_name)):
+        # for vid in videos:
+        source_vid = vid
+        vid_pth = os.path.join(speaker_root, source_vid)
+        try:
+            seqs = [s for s in os.listdir(vid_pth)]
+        except:
+            continue
+        for s in seqs:
+            quality = 0
+            total_seqs.append(os.path.join(vid_pth, s))
+    print("total_seqs:{}".format(total_seqs.__len__()))
+    # split the dataset
+    test_list, val_list, train_list = split_list(total_seqs, True, 0.1)
+    print(len(test_list), len(val_list), len(train_list))
+    moveto(train_list, 'train')
+    moveto(test_list, 'test')
+    moveto(val_list, 'val')

data_utils/get_j.py ADDED Viewed

	@@ -0,0 +1,51 @@

+import torch
+def to3d(poses, config):
+    if config.Data.pose.convert_to_6d:
+        if config.Data.pose.expression:
+            poses_exp = poses[:, -100:]
+            poses = poses[:, :-100]
+        poses = poses.reshape(poses.shape[0], -1, 5)
+        sin, cos = poses[:, :, 3], poses[:, :, 4]
+        pose_angle = torch.atan2(sin, cos)
+        poses = (poses[:, :, :3] * pose_angle.unsqueeze(dim=-1)).reshape(poses.shape[0], -1)
+        if config.Data.pose.expression:
+            poses = torch.cat([poses, poses_exp], dim=-1)
+    return poses
+def get_joint(smplx_model, betas, pred):
+    joint = smplx_model(betas=betas.repeat(pred.shape[0], 1),
+                        expression=pred[:, 165:265],
+                        jaw_pose=pred[:, 0:3],
+                        leye_pose=pred[:, 3:6],
+                        reye_pose=pred[:, 6:9],
+                        global_orient=pred[:, 9:12],
+                        body_pose=pred[:, 12:75],
+                        left_hand_pose=pred[:, 75:120],
+                        right_hand_pose=pred[:, 120:165],
+                        return_verts=True)['joints']
+    return joint
+def get_joints(smplx_model, betas, pred):
+    if len(pred.shape) == 3:
+        B = pred.shape[0]
+        x = 4 if B>= 4 else B
+        T = pred.shape[1]
+        pred = pred.reshape(-1, 265)
+        smplx_model.batch_size = L = T * x
+        times = pred.shape[0] // smplx_model.batch_size
+        joints = []
+        for i in range(times):
+            joints.append(get_joint(smplx_model, betas, pred[i*L:(i+1)*L]))
+        joints = torch.cat(joints, dim=0)
+        joints = joints.reshape(B, T, -1, 3)
+    else:
+        smplx_model.batch_size = pred.shape[0]
+        joints = get_joint(smplx_model, betas, pred)
+    return joints

data_utils/hand_component.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data_utils/lower_body.py ADDED Viewed

	@@ -0,0 +1,143 @@

+import numpy as np
+import torch
+lower_pose = torch.tensor(
+    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0747, -0.0158, -0.0152, -1.1826512813568115, 0.23866955935955048,
+     0.15146760642528534, -1.2604516744613647, -0.3160211145877838,
+     -0.1603458970785141, 1.1654603481292725, 0.0, 0.0, 1.2521806955337524, 0.041598282754421234, -0.06312154978513718,
+     0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
+lower_pose_stand = torch.tensor([
+    8.9759e-04, 7.1074e-04, -5.9163e-06, 8.9759e-04, 7.1074e-04, -5.9163e-06,
+    3.0747, -0.0158, -0.0152,
+    -3.6665e-01, -8.8455e-03, 1.6113e-01, -3.6665e-01, -8.8455e-03, 1.6113e-01,
+    -3.9716e-01, -4.0229e-02, -1.2637e-01,
+    7.9163e-01, 6.8519e-02, -1.5091e-01, 7.9163e-01, 6.8519e-02, -1.5091e-01,
+    7.8632e-01, -4.3810e-02, 1.4375e-02,
+    -1.0675e-01, 1.2635e-01, 1.6711e-02, -1.0675e-01, 1.2635e-01, 1.6711e-02, ])
+# lower_pose_stand = torch.tensor(
+#     [6.4919e-02,  3.3018e-02,  1.7485e-02,  8.9759e-04,  7.1074e-04, -5.9163e-06,
+#      3.0747, -0.0158, -0.0152,
+#      -3.3633e+00, -9.3915e-02, 3.0996e-01, -3.6665e-01, -8.8455e-03, 1.6113e-01,
+#      1.1654603481292725, 0.0, 0.0,
+#      4.4167e-01,  6.7183e-03, -3.6379e-03,  7.9163e-01,  6.8519e-02, -1.5091e-01,
+#      0.0, 0.0, 0.0,
+#      2.2910e-02, -2.4797e-02, -5.5657e-03, -1.0675e-01,  1.2635e-01,  1.6711e-02,])
+lower_body = [0, 1, 3, 4, 6, 7, 9, 10]
+count_part = [6, 9, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
+              29, 30, 31, 32, 33, 34, 35, 36, 37,
+              38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54]
+fix_index = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
+             29,
+             35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
+             50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
+             65, 66, 67, 68, 69, 70, 71, 72, 73, 74]
+all_index = np.ones(275)
+all_index[fix_index] = 0
+c_index = []
+i = 0
+for num in all_index:
+    if num == 1:
+        c_index.append(i)
+    i = i + 1
+c_index = np.asarray(c_index)
+fix_index_3d = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
+                21, 22, 23, 24, 25, 26,
+                30, 31, 32, 33, 34, 35,
+                45, 46, 47, 48, 49, 50]
+all_index_3d = np.ones(165)
+all_index_3d[fix_index_3d] = 0
+c_index_3d = []
+i = 0
+for num in all_index_3d:
+    if num == 1:
+        c_index_3d.append(i)
+    i = i + 1
+c_index_3d = np.asarray(c_index_3d)
+c_index_6d = []
+i = 0
+for num in all_index_3d:
+    if num == 1:
+        c_index_6d.append(2*i)
+        c_index_6d.append(2 * i + 1)
+    i = i + 1
+c_index_6d = np.asarray(c_index_6d)
+def part2full(input, stand=False):
+    if stand:
+        # lp = lower_pose_stand.unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
+        lp = torch.zeros_like(lower_pose)
+        lp[6:9] = torch.tensor([3.0747, -0.0158, -0.0152])
+        lp = lp.unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
+    else:
+        lp = lower_pose.unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
+    input = torch.cat([input[:, :3],
+                       lp[:, :15],
+                       input[:, 3:6],
+                       lp[:, 15:21],
+                       input[:, 6:9],
+                       lp[:, 21:27],
+                       input[:, 9:12],
+                       lp[:, 27:],
+                       input[:, 12:]]
+                      , dim=1)
+    return input
+def pred2poses(input, gt):
+    input = torch.cat([input[:, :3],
+                       gt[0:1, 3:18].repeat(input.shape[0], 1),
+                       input[:, 3:6],
+                       gt[0:1, 21:27].repeat(input.shape[0], 1),
+                       input[:, 6:9],
+                       gt[0:1, 30:36].repeat(input.shape[0], 1),
+                       input[:, 9:12],
+                       gt[0:1, 39:45].repeat(input.shape[0], 1),
+                       input[:, 12:]]
+                      , dim=1)
+    return input
+def poses2poses(input, gt):
+    input = torch.cat([input[:, :3],
+                       gt[0:1, 3:18].repeat(input.shape[0], 1),
+                       input[:, 18:21],
+                       gt[0:1, 21:27].repeat(input.shape[0], 1),
+                       input[:, 27:30],
+                       gt[0:1, 30:36].repeat(input.shape[0], 1),
+                       input[:, 36:39],
+                       gt[0:1, 39:45].repeat(input.shape[0], 1),
+                       input[:, 45:]]
+                      , dim=1)
+    return input
+def poses2pred(input, stand=False):
+    if stand:
+        lp = lower_pose_stand.unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
+        # lp = torch.zeros_like(lower_pose).unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
+    else:
+        lp = lower_pose.unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
+    input = torch.cat([input[:, :3],
+                       lp[:, :15],
+                       input[:, 18:21],
+                       lp[:, 15:21],
+                       input[:, 27:30],
+                       lp[:, 21:27],
+                       input[:, 36:39],
+                       lp[:, 27:],
+                       input[:, 45:]]
+                      , dim=1)
+    return input
+rearrange = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]\
+            # ,22, 23, 24, 25, 40, 26, 41,
+            #  27, 42, 28, 43, 29, 44, 30, 45, 31, 46, 32, 47, 33, 48, 34, 49, 35, 50, 36, 51, 37, 52, 38, 53, 39, 54, 55,
+            #  57, 56, 59, 58, 60, 63, 61, 64, 62, 65, 66, 71, 67, 72, 68, 73, 69, 74, 70, 75]
+symmetry = [0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1]#, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+            # 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+            # 1, 1, 1, 1, 1, 1]

data_utils/mesh_dataset.py ADDED Viewed

	@@ -0,0 +1,348 @@

+import pickle
+import sys
+import os
+sys.path.append(os.getcwd())
+import json
+from glob import glob
+from data_utils.utils import *
+import torch.utils.data as data
+from data_utils.consts import speaker_id
+from data_utils.lower_body import count_part
+import random
+from data_utils.rotation_conversion import axis_angle_to_matrix, matrix_to_rotation_6d
+with open('data_utils/hand_component.json') as file_obj:
+    comp = json.load(file_obj)
+    left_hand_c = np.asarray(comp['left'])
+    right_hand_c = np.asarray(comp['right'])
+def to3d(data):
+    left_hand_pose = np.einsum('bi,ij->bj', data[:, 75:87], left_hand_c[:12, :])
+    right_hand_pose = np.einsum('bi,ij->bj', data[:, 87:99], right_hand_c[:12, :])
+    data = np.concatenate((data[:, :75], left_hand_pose, right_hand_pose), axis=-1)
+    return data
+class SmplxDataset():
+    '''
+    creat a dataset for every segment and concat.
+    '''
+    def __init__(self,
+                 data_root,
+                 speaker,
+                 motion_fn,
+                 audio_fn,
+                 audio_sr,
+                 fps,
+                 feat_method='mel_spec',
+                 audio_feat_dim=64,
+                 audio_feat_win_size=None,
+                 train=True,
+                 load_all=False,
+                 split_trans_zero=False,
+                 limbscaling=False,
+                 num_frames=25,
+                 num_pre_frames=25,
+                 num_generate_length=25,
+                 context_info=False,
+                 convert_to_6d=False,
+                 expression=False,
+                 config=None,
+                 am=None,
+                 am_sr=None,
+                 whole_video=False
+                 ):
+        self.data_root = data_root
+        self.speaker = speaker
+        self.feat_method = feat_method
+        self.audio_fn = audio_fn
+        self.audio_sr = audio_sr
+        self.fps = fps
+        self.audio_feat_dim = audio_feat_dim
+        self.audio_feat_win_size = audio_feat_win_size
+        self.context_info = context_info  # for aud feat
+        self.convert_to_6d = convert_to_6d
+        self.expression = expression
+        self.train = train
+        self.load_all = load_all
+        self.split_trans_zero = split_trans_zero
+        self.limbscaling = limbscaling
+        self.num_frames = num_frames
+        self.num_pre_frames = num_pre_frames
+        self.num_generate_length = num_generate_length
+        # print('num_generate_length ', self.num_generate_length)
+        self.config = config
+        self.am_sr = am_sr
+        self.whole_video = whole_video
+        load_mode = self.config.dataset_load_mode
+        if load_mode == 'pickle':
+            raise NotImplementedError
+        elif load_mode == 'csv':
+            import pickle
+            with open(data_root, 'rb') as f:
+                u = pickle._Unpickler(f)
+                data = u.load()
+                self.data = data[0]
+            if self.load_all:
+                self._load_npz_all()
+        elif load_mode == 'json':
+            self.annotations = glob(data_root + '/*pkl')
+            if len(self.annotations) == 0:
+                raise FileNotFoundError(data_root + ' are empty')
+            self.annotations = sorted(self.annotations)
+            self.img_name_list = self.annotations
+            if self.load_all:
+                self._load_them_all(am, am_sr, motion_fn)
+    def _load_npz_all(self):
+        self.loaded_data = {}
+        self.complete_data = []
+        data = self.data
+        shape = data['body_pose_axis'].shape[0]
+        self.betas = data['betas']
+        self.img_name_list = []
+        for index in range(shape):
+            img_name = f'{index:6d}'
+            self.img_name_list.append(img_name)
+            jaw_pose = data['jaw_pose'][index]
+            leye_pose = data['leye_pose'][index]
+            reye_pose = data['reye_pose'][index]
+            global_orient = data['global_orient'][index]
+            body_pose = data['body_pose_axis'][index]
+            left_hand_pose = data['left_hand_pose'][index]
+            right_hand_pose = data['right_hand_pose'][index]
+            full_body = np.concatenate(
+                (jaw_pose, leye_pose, reye_pose, global_orient, body_pose, left_hand_pose, right_hand_pose))
+            assert full_body.shape[0] == 99
+            if self.convert_to_6d:
+                full_body = to3d(full_body)
+                full_body = torch.from_numpy(full_body)
+                full_body = matrix_to_rotation_6d(axis_angle_to_matrix(full_body))
+                full_body = np.asarray(full_body)
+                if self.expression:
+                    expression = data['expression'][index]
+                    full_body = np.concatenate((full_body, expression))
+                # full_body = np.concatenate((full_body, non_zero))
+            else:
+                full_body = to3d(full_body)
+                if self.expression:
+                    expression = data['expression'][index]
+                    full_body = np.concatenate((full_body, expression))
+            self.loaded_data[img_name] = full_body.reshape(-1)
+            self.complete_data.append(full_body.reshape(-1))
+        self.complete_data = np.array(self.complete_data)
+        if self.audio_feat_win_size is not None:
+            self.audio_feat = get_mfcc_old(self.audio_fn).transpose(1, 0)
+            # print(self.audio_feat.shape)
+        else:
+            if self.feat_method == 'mel_spec':
+                self.audio_feat = get_melspec(self.audio_fn, fps=self.fps, sr=self.audio_sr, n_mels=self.audio_feat_dim)
+            elif self.feat_method == 'mfcc':
+                self.audio_feat = get_mfcc(self.audio_fn,
+                                           smlpx=True,
+                                           sr=self.audio_sr,
+                                           n_mfcc=self.audio_feat_dim,
+                                           win_size=self.audio_feat_win_size
+                                           )
+    def _load_them_all(self, am, am_sr, motion_fn):
+        self.loaded_data = {}
+        self.complete_data = []
+        f = open(motion_fn, 'rb+')
+        data = pickle.load(f)
+        self.betas = np.array(data['betas'])
+        jaw_pose = np.array(data['jaw_pose'])
+        leye_pose = np.array(data['leye_pose'])
+        reye_pose = np.array(data['reye_pose'])
+        global_orient = np.array(data['global_orient']).squeeze()
+        body_pose = np.array(data['body_pose_axis'])
+        left_hand_pose = np.array(data['left_hand_pose'])
+        right_hand_pose = np.array(data['right_hand_pose'])
+        full_body = np.concatenate(
+            (jaw_pose, leye_pose, reye_pose, global_orient, body_pose, left_hand_pose, right_hand_pose), axis=1)
+        assert full_body.shape[1] == 99
+        if self.convert_to_6d:
+            full_body = to3d(full_body)
+            full_body = torch.from_numpy(full_body)
+            full_body = matrix_to_rotation_6d(axis_angle_to_matrix(full_body.reshape(-1, 55, 3))).reshape(-1, 330)
+            full_body = np.asarray(full_body)
+            if self.expression:
+                expression = np.array(data['expression'])
+                full_body = np.concatenate((full_body, expression), axis=1)
+        else:
+            full_body = to3d(full_body)
+            expression = np.array(data['expression'])
+            full_body = np.concatenate((full_body, expression), axis=1)
+        self.complete_data = full_body
+        self.complete_data = np.array(self.complete_data)
+        if self.audio_feat_win_size is not None:
+            self.audio_feat = get_mfcc_old(self.audio_fn).transpose(1, 0)
+        else:
+            # if self.feat_method == 'mel_spec':
+            #     self.audio_feat = get_melspec(self.audio_fn, fps=self.fps, sr=self.audio_sr, n_mels=self.audio_feat_dim)
+            # elif self.feat_method == 'mfcc':
+            self.audio_feat = get_mfcc_ta(self.audio_fn,
+                                          smlpx=True,
+                                          fps=30,
+                                          sr=self.audio_sr,
+                                          n_mfcc=self.audio_feat_dim,
+                                          win_size=self.audio_feat_win_size,
+                                          type=self.feat_method,
+                                          am=am,
+                                          am_sr=am_sr,
+                                          encoder_choice=self.config.Model.encoder_choice,
+                                          )
+            # with open(audio_file, 'w', encoding='utf-8') as file:
+            #     file.write(json.dumps(self.audio_feat.__array__().tolist(), indent=0, ensure_ascii=False))
+    def get_dataset(self, normalization=False, normalize_stats=None, split='train'):
+        class __Worker__(data.Dataset):
+            def __init__(child, index_list, normalization, normalize_stats, split='train') -> None:
+                super().__init__()
+                child.index_list = index_list
+                child.normalization = normalization
+                child.normalize_stats = normalize_stats
+                child.split = split
+            def __getitem__(child, index):
+                num_generate_length = self.num_generate_length
+                num_pre_frames = self.num_pre_frames
+                seq_len = num_generate_length + num_pre_frames
+                # print(num_generate_length)
+                index = child.index_list[index]
+                index_new = index + random.randrange(0, 5, 3)
+                if index_new + seq_len > self.complete_data.shape[0]:
+                    index_new = index
+                index = index_new
+                if child.split in ['val', 'pre', 'test'] or self.whole_video:
+                    index = 0
+                    seq_len = self.complete_data.shape[0]
+                seq_data = []
+                assert index + seq_len <= self.complete_data.shape[0]
+                # print(seq_len)
+                seq_data = self.complete_data[index:(index + seq_len), :]
+                seq_data = np.array(seq_data)
+                '''
+                audio feature，
+                '''
+                if not self.context_info:
+                    if not self.whole_video:
+                        audio_feat = self.audio_feat[index:index + seq_len, ...]
+                        if audio_feat.shape[0] < seq_len:
+                            audio_feat = np.pad(audio_feat, [[0, seq_len - audio_feat.shape[0]], [0, 0]],
+                                                mode='reflect')
+                        assert audio_feat.shape[0] == seq_len and audio_feat.shape[1] == self.audio_feat_dim
+                    else:
+                        audio_feat = self.audio_feat
+                else:  # including feature and history
+                    if self.audio_feat_win_size is None:
+                        audio_feat = self.audio_feat[index:index + seq_len + num_pre_frames, ...]
+                        if audio_feat.shape[0] < seq_len + num_pre_frames:
+                            audio_feat = np.pad(audio_feat,
+                                                [[0, seq_len + self.num_frames - audio_feat.shape[0]], [0, 0]],
+                                                mode='constant')
+                        assert audio_feat.shape[0] == self.num_frames + seq_len and audio_feat.shape[
+                            1] == self.audio_feat_dim
+                if child.normalization:
+                    data_mean = child.normalize_stats['mean'].reshape(1, -1)
+                    data_std = child.normalize_stats['std'].reshape(1, -1)
+                    seq_data[:, :330] = (seq_data[:, :330] - data_mean) / data_std
+                if child.split in['train', 'test']:
+                    if self.convert_to_6d:
+                        if self.expression:
+                            data_sample = {
+                                'poses': seq_data[:, :330].astype(np.float).transpose(1, 0),
+                                'expression': seq_data[:, 330:].astype(np.float).transpose(1, 0),
+                                # 'nzero': seq_data[:, 375:].astype(np.float).transpose(1, 0),
+                                'aud_feat': audio_feat.astype(np.float).transpose(1, 0),
+                                'speaker': speaker_id[self.speaker],
+                                'betas': self.betas,
+                                'aud_file': self.audio_fn,
+                            }
+                        else:
+                            data_sample = {
+                                'poses': seq_data[:, :330].astype(np.float).transpose(1, 0),
+                                'nzero': seq_data[:, 330:].astype(np.float).transpose(1, 0),
+                                'aud_feat': audio_feat.astype(np.float).transpose(1, 0),
+                                'speaker': speaker_id[self.speaker],
+                                'betas': self.betas
+                            }
+                    else:
+                        if self.expression:
+                            data_sample = {
+                                'poses': seq_data[:, :165].astype(np.float).transpose(1, 0),
+                                'expression': seq_data[:, 165:].astype(np.float).transpose(1, 0),
+                                'aud_feat': audio_feat.astype(np.float).transpose(1, 0),
+                                # 'wv2_feat': wv2_feat.astype(np.float).transpose(1, 0),
+                                'speaker': speaker_id[self.speaker],
+                                'aud_file': self.audio_fn,
+                                'betas': self.betas
+                            }
+                        else:
+                            data_sample = {
+                                'poses': seq_data.astype(np.float).transpose(1, 0),
+                                'aud_feat': audio_feat.astype(np.float).transpose(1, 0),
+                                'speaker': speaker_id[self.speaker],
+                                'betas': self.betas
+                            }
+                    return data_sample
+                else:
+                    data_sample = {
+                        'poses': seq_data[:, :330].astype(np.float).transpose(1, 0),
+                        'expression': seq_data[:, 330:].astype(np.float).transpose(1, 0),
+                        # 'nzero': seq_data[:, 325:].astype(np.float).transpose(1, 0),
+                        'aud_feat': audio_feat.astype(np.float).transpose(1, 0),
+                        'aud_file': self.audio_fn,
+                        'speaker': speaker_id[self.speaker],
+                        'betas': self.betas
+                    }
+                    return data_sample
+            def __len__(child):
+                return len(child.index_list)
+        if split == 'train':
+            index_list = list(
+                range(0, min(self.complete_data.shape[0], self.audio_feat.shape[0]) - self.num_generate_length - self.num_pre_frames,
+                      6))
+        elif split in ['val', 'test']:
+            index_list = list([0])
+        if self.whole_video:
+            index_list = list([0])
+        self.all_dataset = __Worker__(index_list, normalization, normalize_stats, split)
+    def __len__(self):
+        return len(self.img_name_list)

data_utils/rotation_conversion.py ADDED Viewed

	@@ -0,0 +1,551 @@

+# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
+# Check PYTORCH3D_LICENCE before use
+import functools
+from typing import Optional
+import torch
+import torch.nn.functional as F
+"""
+The transformation matrices returned from the functions in this file assume
+the points on which the transformation will be applied are column vectors.
+i.e. the R matrix is structured as
+    R = [
+            [Rxx, Rxy, Rxz],
+            [Ryx, Ryy, Ryz],
+            [Rzx, Rzy, Rzz],
+        ]  # (3, 3)
+This matrix can be applied to column vectors by post multiplication
+by the points e.g.
+    points = [[0], [1], [2]]  # (3 x 1) xyz coordinates of a point
+    transformed_points = R * points
+To apply the same matrix to points which are row vectors, the R matrix
+can be transposed and pre multiplied by the points:
+e.g.
+    points = [[0, 1, 2]]  # (1 x 3) xyz coordinates of a point
+    transformed_points = points * R.transpose(1, 0)
+"""
+def quaternion_to_matrix(quaternions):
+    """
+    Convert rotations given as quaternions to rotation matrices.
+    Args:
+        quaternions: quaternions with real part first,
+            as tensor of shape (..., 4).
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    r, i, j, k = torch.unbind(quaternions, -1)
+    two_s = 2.0 / (quaternions * quaternions).sum(-1)
+    o = torch.stack(
+        (
+            1 - two_s * (j * j + k * k),
+            two_s * (i * j - k * r),
+            two_s * (i * k + j * r),
+            two_s * (i * j + k * r),
+            1 - two_s * (i * i + k * k),
+            two_s * (j * k - i * r),
+            two_s * (i * k - j * r),
+            two_s * (j * k + i * r),
+            1 - two_s * (i * i + j * j),
+        ),
+        -1,
+    )
+    return o.reshape(quaternions.shape[:-1] + (3, 3))
+def _copysign(a, b):
+    """
+    Return a tensor where each element has the absolute value taken from the,
+    corresponding element of a, with sign taken from the corresponding
+    element of b. This is like the standard copysign floating-point operation,
+    but is not careful about negative 0 and NaN.
+    Args:
+        a: source tensor.
+        b: tensor whose signs will be used, of the same shape as a.
+    Returns:
+        Tensor of the same shape as a with the signs of b.
+    """
+    signs_differ = (a < 0) != (b < 0)
+    return torch.where(signs_differ, -a, a)
+def _sqrt_positive_part(x):
+    """
+    Returns torch.sqrt(torch.max(0, x))
+    but with a zero subgradient where x is 0.
+    """
+    ret = torch.zeros_like(x)
+    positive_mask = x > 0
+    ret[positive_mask] = torch.sqrt(x[positive_mask])
+    return ret
+def matrix_to_quaternion(matrix):
+    """
+    Convert rotations given as rotation matrices to quaternions.
+    Args:
+        matrix: Rotation matrices as tensor of shape (..., 3, 3).
+    Returns:
+        quaternions with real part first, as tensor of shape (..., 4).
+    """
+    if matrix.size(-1) != 3 or matrix.size(-2) != 3:
+        raise ValueError(f"Invalid rotation matrix  shape f{matrix.shape}.")
+    m00 = matrix[..., 0, 0]
+    m11 = matrix[..., 1, 1]
+    m22 = matrix[..., 2, 2]
+    o0 = 0.5 * _sqrt_positive_part(1 + m00 + m11 + m22)
+    x = 0.5 * _sqrt_positive_part(1 + m00 - m11 - m22)
+    y = 0.5 * _sqrt_positive_part(1 - m00 + m11 - m22)
+    z = 0.5 * _sqrt_positive_part(1 - m00 - m11 + m22)
+    o1 = _copysign(x, matrix[..., 2, 1] - matrix[..., 1, 2])
+    o2 = _copysign(y, matrix[..., 0, 2] - matrix[..., 2, 0])
+    o3 = _copysign(z, matrix[..., 1, 0] - matrix[..., 0, 1])
+    return torch.stack((o0, o1, o2, o3), -1)
+def _axis_angle_rotation(axis: str, angle):
+    """
+    Return the rotation matrices for one of the rotations about an axis
+    of which Euler angles describe, for each value of the angle given.
+    Args:
+        axis: Axis label "X" or "Y or "Z".
+        angle: any shape tensor of Euler angles in radians
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    cos = torch.cos(angle)
+    sin = torch.sin(angle)
+    one = torch.ones_like(angle)
+    zero = torch.zeros_like(angle)
+    if axis == "X":
+        R_flat = (one, zero, zero, zero, cos, -sin, zero, sin, cos)
+    if axis == "Y":
+        R_flat = (cos, zero, sin, zero, one, zero, -sin, zero, cos)
+    if axis == "Z":
+        R_flat = (cos, -sin, zero, sin, cos, zero, zero, zero, one)
+    return torch.stack(R_flat, -1).reshape(angle.shape + (3, 3))
+def euler_angles_to_matrix(euler_angles, convention: str):
+    """
+    Convert rotations given as Euler angles in radians to rotation matrices.
+    Args:
+        euler_angles: Euler angles in radians as tensor of shape (..., 3).
+        convention: Convention string of three uppercase letters from
+            {"X", "Y", and "Z"}.
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    if euler_angles.dim() == 0 or euler_angles.shape[-1] != 3:
+        raise ValueError("Invalid input euler angles.")
+    if len(convention) != 3:
+        raise ValueError("Convention must have 3 letters.")
+    if convention[1] in (convention[0], convention[2]):
+        raise ValueError(f"Invalid convention {convention}.")
+    for letter in convention:
+        if letter not in ("X", "Y", "Z"):
+            raise ValueError(f"Invalid letter {letter} in convention string.")
+    matrices = map(_axis_angle_rotation, convention, torch.unbind(euler_angles, -1))
+    return functools.reduce(torch.matmul, matrices)
+def _angle_from_tan(
+    axis: str, other_axis: str, data, horizontal: bool, tait_bryan: bool
+):
+    """
+    Extract the first or third Euler angle from the two members of
+    the matrix which are positive constant times its sine and cosine.
+    Args:
+        axis: Axis label "X" or "Y or "Z" for the angle we are finding.
+        other_axis: Axis label "X" or "Y or "Z" for the middle axis in the
+            convention.
+        data: Rotation matrices as tensor of shape (..., 3, 3).
+        horizontal: Whether we are looking for the angle for the third axis,
+            which means the relevant entries are in the same row of the
+            rotation matrix. If not, they are in the same column.
+        tait_bryan: Whether the first and third axes in the convention differ.
+    Returns:
+        Euler Angles in radians for each matrix in data as a tensor
+        of shape (...).
+    """
+    i1, i2 = {"X": (2, 1), "Y": (0, 2), "Z": (1, 0)}[axis]
+    if horizontal:
+        i2, i1 = i1, i2
+    even = (axis + other_axis) in ["XY", "YZ", "ZX"]
+    if horizontal == even:
+        return torch.atan2(data[..., i1], data[..., i2])
+    if tait_bryan:
+        return torch.atan2(-data[..., i2], data[..., i1])
+    return torch.atan2(data[..., i2], -data[..., i1])
+def _index_from_letter(letter: str):
+    if letter == "X":
+        return 0
+    if letter == "Y":
+        return 1
+    if letter == "Z":
+        return 2
+def matrix_to_euler_angles(matrix, convention: str):
+    """
+    Convert rotations given as rotation matrices to Euler angles in radians.
+    Args:
+        matrix: Rotation matrices as tensor of shape (..., 3, 3).
+        convention: Convention string of three uppercase letters.
+    Returns:
+        Euler angles in radians as tensor of shape (..., 3).
+    """
+    if len(convention) != 3:
+        raise ValueError("Convention must have 3 letters.")
+    if convention[1] in (convention[0], convention[2]):
+        raise ValueError(f"Invalid convention {convention}.")
+    for letter in convention:
+        if letter not in ("X", "Y", "Z"):
+            raise ValueError(f"Invalid letter {letter} in convention string.")
+    if matrix.size(-1) != 3 or matrix.size(-2) != 3:
+        raise ValueError(f"Invalid rotation matrix  shape f{matrix.shape}.")
+    i0 = _index_from_letter(convention[0])
+    i2 = _index_from_letter(convention[2])
+    tait_bryan = i0 != i2
+    if tait_bryan:
+        central_angle = torch.asin(
+            matrix[..., i0, i2] * (-1.0 if i0 - i2 in [-1, 2] else 1.0)
+        )
+    else:
+        central_angle = torch.acos(matrix[..., i0, i0])
+    o = (
+        _angle_from_tan(
+            convention[0], convention[1], matrix[..., i2], False, tait_bryan
+        ),
+        central_angle,
+        _angle_from_tan(
+            convention[2], convention[1], matrix[..., i0, :], True, tait_bryan
+        ),
+    )
+    return torch.stack(o, -1)
+def random_quaternions(
+    n: int, dtype: Optional[torch.dtype] = None, device=None, requires_grad=False
+):
+    """
+    Generate random quaternions representing rotations,
+    i.e. versors with nonnegative real part.
+    Args:
+        n: Number of quaternions in a batch to return.
+        dtype: Type to return.
+        device: Desired device of returned tensor. Default:
+            uses the current device for the default tensor type.
+        requires_grad: Whether the resulting tensor should have the gradient
+            flag set.
+    Returns:
+        Quaternions as tensor of shape (N, 4).
+    """
+    o = torch.randn((n, 4), dtype=dtype, device=device, requires_grad=requires_grad)
+    s = (o * o).sum(1)
+    o = o / _copysign(torch.sqrt(s), o[:, 0])[:, None]
+    return o
+def random_rotations(
+    n: int, dtype: Optional[torch.dtype] = None, device=None, requires_grad=False
+):
+    """
+    Generate random rotations as 3x3 rotation matrices.
+    Args:
+        n: Number of rotation matrices in a batch to return.
+        dtype: Type to return.
+        device: Device of returned tensor. Default: if None,
+            uses the current device for the default tensor type.
+        requires_grad: Whether the resulting tensor should have the gradient
+            flag set.
+    Returns:
+        Rotation matrices as tensor of shape (n, 3, 3).
+    """
+    quaternions = random_quaternions(
+        n, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+    return quaternion_to_matrix(quaternions)
+def random_rotation(
+    dtype: Optional[torch.dtype] = None, device=None, requires_grad=False
+):
+    """
+    Generate a single random 3x3 rotation matrix.
+    Args:
+        dtype: Type to return
+        device: Device of returned tensor. Default: if None,
+            uses the current device for the default tensor type
+        requires_grad: Whether the resulting tensor should have the gradient
+            flag set
+    Returns:
+        Rotation matrix as tensor of shape (3, 3).
+    """
+    return random_rotations(1, dtype, device, requires_grad)[0]
+def standardize_quaternion(quaternions):
+    """
+    Convert a unit quaternion to a standard form: one in which the real
+    part is non negative.
+    Args:
+        quaternions: Quaternions with real part first,
+            as tensor of shape (..., 4).
+    Returns:
+        Standardized quaternions as tensor of shape (..., 4).
+    """
+    return torch.where(quaternions[..., 0:1] < 0, -quaternions, quaternions)
+def quaternion_raw_multiply(a, b):
+    """
+    Multiply two quaternions.
+    Usual torch rules for broadcasting apply.
+    Args:
+        a: Quaternions as tensor of shape (..., 4), real part first.
+        b: Quaternions as tensor of shape (..., 4), real part first.
+    Returns:
+        The product of a and b, a tensor of quaternions shape (..., 4).
+    """
+    aw, ax, ay, az = torch.unbind(a, -1)
+    bw, bx, by, bz = torch.unbind(b, -1)
+    ow = aw * bw - ax * bx - ay * by - az * bz
+    ox = aw * bx + ax * bw + ay * bz - az * by
+    oy = aw * by - ax * bz + ay * bw + az * bx
+    oz = aw * bz + ax * by - ay * bx + az * bw
+    return torch.stack((ow, ox, oy, oz), -1)
+def quaternion_multiply(a, b):
+    """
+    Multiply two quaternions representing rotations, returning the quaternion
+    representing their composition, i.e. the versor with nonnegative real part.
+    Usual torch rules for broadcasting apply.
+    Args:
+        a: Quaternions as tensor of shape (..., 4), real part first.
+        b: Quaternions as tensor of shape (..., 4), real part first.
+    Returns:
+        The product of a and b, a tensor of quaternions of shape (..., 4).
+    """
+    ab = quaternion_raw_multiply(a, b)
+    return standardize_quaternion(ab)
+def quaternion_invert(quaternion):
+    """
+    Given a quaternion representing rotation, get the quaternion representing
+    its inverse.
+    Args:
+        quaternion: Quaternions as tensor of shape (..., 4), with real part
+            first, which must be versors (unit quaternions).
+    Returns:
+        The inverse, a tensor of quaternions of shape (..., 4).
+    """
+    return quaternion * quaternion.new_tensor([1, -1, -1, -1])
+def quaternion_apply(quaternion, point):
+    """
+    Apply the rotation given by a quaternion to a 3D point.
+    Usual torch rules for broadcasting apply.
+    Args:
+        quaternion: Tensor of quaternions, real part first, of shape (..., 4).
+        point: Tensor of 3D points of shape (..., 3).
+    Returns:
+        Tensor of rotated points of shape (..., 3).
+    """
+    if point.size(-1) != 3:
+        raise ValueError(f"Points are not in 3D, f{point.shape}.")
+    real_parts = point.new_zeros(point.shape[:-1] + (1,))
+    point_as_quaternion = torch.cat((real_parts, point), -1)
+    out = quaternion_raw_multiply(
+        quaternion_raw_multiply(quaternion, point_as_quaternion),
+        quaternion_invert(quaternion),
+    )
+    return out[..., 1:]
+def axis_angle_to_matrix(axis_angle):
+    """
+    Convert rotations given as axis/angle to rotation matrices.
+    Args:
+        axis_angle: Rotations given as a vector in axis angle form,
+            as a tensor of shape (..., 3), where the magnitude is
+            the angle turned anticlockwise in radians around the
+            vector's direction.
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    return quaternion_to_matrix(axis_angle_to_quaternion(axis_angle))
+def matrix_to_axis_angle(matrix):
+    """
+    Convert rotations given as rotation matrices to axis/angle.
+    Args:
+        matrix: Rotation matrices as tensor of shape (..., 3, 3).
+    Returns:
+        Rotations given as a vector in axis angle form, as a tensor
+            of shape (..., 3), where the magnitude is the angle
+            turned anticlockwise in radians around the vector's
+            direction.
+    """
+    return quaternion_to_axis_angle(matrix_to_quaternion(matrix))
+def axis_angle_to_quaternion(axis_angle):
+    """
+    Convert rotations given as axis/angle to quaternions.
+    Args:
+        axis_angle: Rotations given as a vector in axis angle form,
+            as a tensor of shape (..., 3), where the magnitude is
+            the angle turned anticlockwise in radians around the
+            vector's direction.
+    Returns:
+        quaternions with real part first, as tensor of shape (..., 4).
+    """
+    angles = torch.norm(axis_angle, p=2, dim=-1, keepdim=True)
+    half_angles = 0.5 * angles
+    eps = 1e-6
+    small_angles = angles.abs() < eps
+    sin_half_angles_over_angles = torch.empty_like(angles)
+    sin_half_angles_over_angles[~small_angles] = (
+        torch.sin(half_angles[~small_angles]) / angles[~small_angles]
+    )
+    # for x small, sin(x/2) is about x/2 - (x/2)^3/6
+    # so sin(x/2)/x is about 1/2 - (x*x)/48
+    sin_half_angles_over_angles[small_angles] = (
+        0.5 - (angles[small_angles] * angles[small_angles]) / 48
+    )
+    quaternions = torch.cat(
+        [torch.cos(half_angles), axis_angle * sin_half_angles_over_angles], dim=-1
+    )
+    return quaternions
+def quaternion_to_axis_angle(quaternions):
+    """
+    Convert rotations given as quaternions to axis/angle.
+    Args:
+        quaternions: quaternions with real part first,
+            as tensor of shape (..., 4).
+    Returns:
+        Rotations given as a vector in axis angle form, as a tensor
+            of shape (..., 3), where the magnitude is the angle
+            turned anticlockwise in radians around the vector's
+            direction.
+    """
+    norms = torch.norm(quaternions[..., 1:], p=2, dim=-1, keepdim=True)
+    half_angles = torch.atan2(norms, quaternions[..., :1])
+    angles = 2 * half_angles
+    eps = 1e-6
+    small_angles = angles.abs() < eps
+    sin_half_angles_over_angles = torch.empty_like(angles)
+    sin_half_angles_over_angles[~small_angles] = (
+        torch.sin(half_angles[~small_angles]) / angles[~small_angles]
+    )
+    # for x small, sin(x/2) is about x/2 - (x/2)^3/6
+    # so sin(x/2)/x is about 1/2 - (x*x)/48
+    sin_half_angles_over_angles[small_angles] = (
+        0.5 - (angles[small_angles] * angles[small_angles]) / 48
+    )
+    return quaternions[..., 1:] / sin_half_angles_over_angles
+def rotation_6d_to_matrix(d6: torch.Tensor) -> torch.Tensor:
+    """
+    Converts 6D rotation representation by Zhou et al. [1] to rotation matrix
+    using Gram--Schmidt orthogonalisation per Section B of [1].
+    Args:
+        d6: 6D rotation representation, of size (*, 6)
+    Returns:
+        batch of rotation matrices of size (*, 3, 3)
+    [1] Zhou, Y., Barnes, C., Lu, J., Yang, J., & Li, H.
+    On the Continuity of Rotation Representations in Neural Networks.
+    IEEE Conference on Computer Vision and Pattern Recognition, 2019.
+    Retrieved from http://arxiv.org/abs/1812.07035
+    """
+    a1, a2 = d6[..., :3], d6[..., 3:]
+    b1 = F.normalize(a1, dim=-1)
+    b2 = a2 - (b1 * a2).sum(-1, keepdim=True) * b1
+    b2 = F.normalize(b2, dim=-1)
+    b3 = torch.cross(b1, b2, dim=-1)
+    return torch.stack((b1, b2, b3), dim=-2)
+def matrix_to_rotation_6d(matrix: torch.Tensor) -> torch.Tensor:
+    """
+    Converts rotation matrices to 6D rotation representation by Zhou et al. [1]
+    by dropping the last row. Note that 6D representation is not unique.
+    Args:
+        matrix: batch of rotation matrices of size (*, 3, 3)
+    Returns:
+        6D rotation representation, of size (*, 6)
+    [1] Zhou, Y., Barnes, C., Lu, J., Yang, J., & Li, H.
+    On the Continuity of Rotation Representations in Neural Networks.
+    IEEE Conference on Computer Vision and Pattern Recognition, 2019.
+    Retrieved from http://arxiv.org/abs/1812.07035
+    """
+    return matrix[..., :2, :].clone().reshape(*matrix.size()[:-2], 6)

data_utils/split_more_than_2s.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2df6e745cdf7473f13ce3ae2ed759c3cceb60c9197e7f3fd65110e7bc20b6f2d
+size 2398875

data_utils/split_train_val_test.py ADDED Viewed

	@@ -0,0 +1,27 @@

+import os
+import json
+import shutil
+if __name__ =='__main__':
+    id_list = "chemistry conan oliver seth"
+    id_list = id_list.split(' ')
+    old_root = '/home/usename/talkshow_data/ExpressiveWholeBodyDatasetReleaseV1.0'
+    new_root = '/home/usename/talkshow_data/ExpressiveWholeBodyDatasetReleaseV1.0/talkshow_data_splited'
+    with open('train_val_test.json') as f:
+        split_info = json.load(f)
+    phase_list = ['train', 'val', 'test']
+    for phase in phase_list:
+        phase_path_list = split_info[phase]
+        for p in phase_path_list:
+            old_path = os.path.join(old_root, p)
+            if not os.path.exists(old_path):
+                print(f'{old_path} not found, continue' )
+                continue
+            new_path = os.path.join(new_root, phase, p)
+            dir_name = os.path.dirname(new_path)
+            if not os.path.isdir(dir_name):
+                os.makedirs(dir_name, exist_ok=True)
+            shutil.move(old_path, new_path)

data_utils/train_val_test.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data_utils/utils.py ADDED Viewed

	@@ -0,0 +1,318 @@

+import numpy as np
+# import librosa #has to do this cause librosa is not supported on my server
+import python_speech_features
+from scipy.io import wavfile
+from scipy import signal
+import librosa
+import torch
+import torchaudio as ta
+import torchaudio.functional as ta_F
+import torchaudio.transforms as ta_T
+# import pyloudnorm as pyln
+def load_wav_old(audio_fn, sr = 16000):
+    sample_rate, sig = wavfile.read(audio_fn)
+    if sample_rate != sr:
+        result = int((sig.shape[0]) / sample_rate * sr)
+        x_resampled = signal.resample(sig, result)
+        x_resampled = x_resampled.astype(np.float64)
+        return x_resampled, sr
+    sig = sig / (2**15)
+    return sig, sample_rate
+def get_mfcc(audio_fn, eps=1e-6, fps=25, smlpx=False, sr=16000, n_mfcc=64, win_size=None):
+    y, sr = librosa.load(audio_fn, sr=sr, mono=True)
+    if win_size is None:
+        hop_len=int(sr / fps)
+    else:
+        hop_len=int(sr / win_size)
+    n_fft=2048
+    C = librosa.feature.mfcc(
+        y = y,
+        sr = sr,
+        n_mfcc = n_mfcc,
+        hop_length = hop_len,
+        n_fft = n_fft
+    )
+    if C.shape[0] == n_mfcc:
+        C = C.transpose(1, 0)
+    return C
+def get_melspec(audio_fn, eps=1e-6, fps = 25, sr=16000, n_mels=64):
+    raise NotImplementedError
+    '''
+    # y, sr = load_wav(audio_fn=audio_fn, sr=sr)
+    # hop_len = int(sr / fps)
+    # n_fft = 2048
+    # C = librosa.feature.melspectrogram(
+    #     y = y,
+    #     sr = sr,
+    #     n_fft=n_fft,
+    #     hop_length=hop_len,
+    #     n_mels = n_mels,
+    #     fmin=0,
+    #     fmax=8000)
+    # mask = (C == 0).astype(np.float)
+    # C = mask * eps + (1-mask) * C
+    # C = np.log(C)
+    # #wierd error may occur here
+    # assert not (np.isnan(C).any()), audio_fn
+    # if C.shape[0] == n_mels:
+    #     C = C.transpose(1, 0)
+    # return C
+    '''
+def extract_mfcc(audio,sample_rate=16000):
+    mfcc = zip(*python_speech_features.mfcc(audio,sample_rate, numcep=64, nfilt=64, nfft=2048, winstep=0.04))
+    mfcc = np.stack([np.array(i) for i in mfcc])
+    return mfcc
+def get_mfcc_psf(audio_fn, eps=1e-6, fps=25, smlpx=False, sr=16000, n_mfcc=64, win_size=None):
+    y, sr = load_wav_old(audio_fn, sr=sr)
+    if y.shape.__len__() > 1:
+        y = (y[:,0]+y[:,1])/2
+    if win_size is None:
+        hop_len=int(sr / fps)
+    else:
+        hop_len=int(sr/ win_size)
+    n_fft=2048
+    #hard coded for 25 fps
+    if not smlpx:
+        C = python_speech_features.mfcc(y, sr, numcep=n_mfcc, nfilt=n_mfcc, nfft=n_fft, winstep=0.04)
+    else:
+        C = python_speech_features.mfcc(y, sr, numcep=n_mfcc, nfilt=n_mfcc, nfft=n_fft, winstep=1.01/15)
+    # if C.shape[0] == n_mfcc:
+    #     C = C.transpose(1, 0)
+    return C
+def get_mfcc_psf_min(audio_fn, eps=1e-6, fps=25, smlpx=False, sr=16000, n_mfcc=64, win_size=None):
+    y, sr = load_wav_old(audio_fn, sr=sr)
+    if y.shape.__len__() > 1:
+        y = (y[:, 0] + y[:, 1]) / 2
+    n_fft = 2048
+    slice_len = 22000 * 5
+    slice = y.size // slice_len
+    C = []
+    for i in range(slice):
+        if i != (slice - 1):
+            feat = python_speech_features.mfcc(y[i*slice_len:(i+1)*slice_len], sr, numcep=n_mfcc, nfilt=n_mfcc, nfft=n_fft, winstep=1.01 / 15)
+        else:
+            feat = python_speech_features.mfcc(y[i * slice_len:], sr, numcep=n_mfcc, nfilt=n_mfcc, nfft=n_fft, winstep=1.01 / 15)
+        C.append(feat)
+    return C
+def audio_chunking(audio: torch.Tensor, frame_rate: int = 30, chunk_size: int = 16000):
+    """
+    :param audio: 1 x T tensor containing a 16kHz audio signal
+    :param frame_rate: frame rate for video (we need one audio chunk per video frame)
+    :param chunk_size: number of audio samples per chunk
+    :return: num_chunks x chunk_size tensor containing sliced audio
+    """
+    samples_per_frame = chunk_size // frame_rate
+    padding = (chunk_size - samples_per_frame) // 2
+    audio = torch.nn.functional.pad(audio.unsqueeze(0), pad=[padding, padding]).squeeze(0)
+    anchor_points = list(range(chunk_size//2, audio.shape[-1]-chunk_size//2, samples_per_frame))
+    audio = torch.cat([audio[:, i-chunk_size//2:i+chunk_size//2] for i in anchor_points], dim=0)
+    return audio
+def  get_mfcc_ta(audio_fn, eps=1e-6, fps=15, smlpx=False, sr=16000, n_mfcc=64, win_size=None, type='mfcc', am=None, am_sr=None, encoder_choice='mfcc'):
+    if am is None:
+        audio, sr_0 = ta.load(audio_fn)
+        if sr != sr_0:
+            audio = ta.transforms.Resample(sr_0, sr)(audio)
+        if audio.shape[0] > 1:
+            audio = torch.mean(audio, dim=0, keepdim=True)
+        n_fft = 2048
+        if fps == 15:
+            hop_length = 1467
+        elif fps == 30:
+            hop_length = 734
+        win_length = hop_length * 2
+        n_mels = 256
+        n_mfcc = 64
+        if type == 'mfcc':
+            mfcc_transform = ta_T.MFCC(
+                sample_rate=sr,
+                n_mfcc=n_mfcc,
+                melkwargs={
+                    "n_fft": n_fft,
+                    "n_mels": n_mels,
+                    # "win_length": win_length,
+                    "hop_length": hop_length,
+                    "mel_scale": "htk",
+                },
+            )
+            audio_ft = mfcc_transform(audio).squeeze(dim=0).transpose(0,1).numpy()
+        elif type == 'mel':
+            # audio = 0.01 * audio / torch.mean(torch.abs(audio))
+            mel_transform = ta_T.MelSpectrogram(
+                sample_rate=sr, n_fft=n_fft, win_length=None, hop_length=hop_length, n_mels=n_mels
+            )
+            audio_ft = mel_transform(audio).squeeze(0).transpose(0,1).numpy()
+            # audio_ft = torch.log(audio_ft.clamp(min=1e-10, max=None)).transpose(0,1).numpy()
+        elif type == 'mel_mul':
+            audio = 0.01 * audio / torch.mean(torch.abs(audio))
+            audio = audio_chunking(audio, frame_rate=fps, chunk_size=sr)
+            mel_transform = ta_T.MelSpectrogram(
+                sample_rate=sr, n_fft=n_fft, win_length=int(sr/20), hop_length=int(sr/100), n_mels=n_mels
+            )
+            audio_ft = mel_transform(audio).squeeze(1)
+            audio_ft = torch.log(audio_ft.clamp(min=1e-10, max=None)).numpy()
+    else:
+        speech_array, sampling_rate = librosa.load(audio_fn, sr=16000)
+        if encoder_choice == 'faceformer':
+            # audio_ft = np.squeeze(am(speech_array, sampling_rate=16000).input_values).reshape(-1, 1)
+            audio_ft = speech_array.reshape(-1, 1)
+        elif encoder_choice == 'meshtalk':
+            audio_ft = 0.01 * speech_array / np.mean(np.abs(speech_array))
+        elif encoder_choice == 'onset':
+            audio_ft = librosa.onset.onset_detect(y=speech_array, sr=16000, units='time').reshape(-1, 1)
+        else:
+            audio, sr_0 = ta.load(audio_fn)
+            if sr != sr_0:
+                audio = ta.transforms.Resample(sr_0, sr)(audio)
+            if audio.shape[0] > 1:
+                audio = torch.mean(audio, dim=0, keepdim=True)
+            n_fft = 2048
+            if fps == 15:
+                hop_length = 1467
+            elif fps == 30:
+                hop_length = 734
+            win_length = hop_length * 2
+            n_mels = 256
+            n_mfcc = 64
+            mfcc_transform = ta_T.MFCC(
+                sample_rate=sr,
+                n_mfcc=n_mfcc,
+                melkwargs={
+                    "n_fft": n_fft,
+                    "n_mels": n_mels,
+                    # "win_length": win_length,
+                    "hop_length": hop_length,
+                    "mel_scale": "htk",
+                },
+            )
+            audio_ft = mfcc_transform(audio).squeeze(dim=0).transpose(0, 1).numpy()
+    return audio_ft
+def  get_mfcc_sepa(audio_fn, fps=15, sr=16000):
+    audio, sr_0 = ta.load(audio_fn)
+    if sr != sr_0:
+        audio = ta.transforms.Resample(sr_0, sr)(audio)
+    if audio.shape[0] > 1:
+        audio = torch.mean(audio, dim=0, keepdim=True)
+    n_fft = 2048
+    if fps == 15:
+        hop_length = 1467
+    elif fps == 30:
+        hop_length = 734
+    n_mels = 256
+    n_mfcc = 64
+    mfcc_transform = ta_T.MFCC(
+        sample_rate=sr,
+        n_mfcc=n_mfcc,
+        melkwargs={
+            "n_fft": n_fft,
+            "n_mels": n_mels,
+            # "win_length": win_length,
+            "hop_length": hop_length,
+            "mel_scale": "htk",
+        },
+    )
+    audio_ft_0 = mfcc_transform(audio[0, :sr*2]).squeeze(dim=0).transpose(0,1).numpy()
+    audio_ft_1 = mfcc_transform(audio[0, sr*2:]).squeeze(dim=0).transpose(0,1).numpy()
+    audio_ft = np.concatenate((audio_ft_0, audio_ft_1), axis=0)
+    return audio_ft, audio_ft_0.shape[0]
+def get_mfcc_old(wav_file):
+    sig, sample_rate = load_wav_old(wav_file)
+    mfcc = extract_mfcc(sig)
+    return mfcc
+def smooth_geom(geom, mask: torch.Tensor = None, filter_size: int = 9, sigma: float = 2.0):
+    """
+    :param geom: T x V x 3 tensor containing a temporal sequence of length T with V vertices in each frame
+    :param mask: V-dimensional Tensor containing a mask with vertices to be smoothed
+    :param filter_size: size of the Gaussian filter
+    :param sigma: standard deviation of the Gaussian filter
+    :return: T x V x 3 tensor containing smoothed geometry (i.e., smoothed in the area indicated by the mask)
+    """
+    assert filter_size % 2 == 1, f"filter size must be odd but is {filter_size}"
+    # Gaussian smoothing (low-pass filtering)
+    fltr = np.arange(-(filter_size // 2), filter_size // 2 + 1)
+    fltr = np.exp(-0.5 * fltr ** 2 / sigma ** 2)
+    fltr = torch.Tensor(fltr) / np.sum(fltr)
+    # apply fltr
+    fltr = fltr.view(1, 1, -1).to(device=geom.device)
+    T, V = geom.shape[1], geom.shape[2]
+    g = torch.nn.functional.pad(
+        geom.permute(2, 0, 1).view(V, 1, T),
+        pad=[filter_size // 2, filter_size // 2], mode='replicate'
+    )
+    g = torch.nn.functional.conv1d(g, fltr).view(V, 1, T)
+    smoothed = g.permute(1, 2, 0).contiguous()
+    # blend smoothed signal with original signal
+    if mask is None:
+        return smoothed
+    else:
+        return smoothed * mask[None, :, None] + geom * (-mask[None, :, None] + 1)
+if __name__ == '__main__':
+    audio_fn = '../sample_audio/clip000028_tCAkv4ggPgI.wav'
+    C = get_mfcc_psf(audio_fn)
+    print(C.shape)
+    C_2 = get_mfcc_librosa(audio_fn)
+    print(C.shape)
+    print(C)
+    print(C_2)
+    print((C == C_2).all())
+    # print(y.shape, sr)
+    # mel_spec = get_melspec(audio_fn)
+    # print(mel_spec.shape)
+    # mfcc = get_mfcc(audio_fn, sr = 16000)
+    # print(mfcc.shape)
+    # print(mel_spec.max(), mel_spec.min())
+    # print(mfcc.max(), mfcc.min())

evaluation/FGD.py ADDED Viewed

	@@ -0,0 +1,199 @@

+import time
+import numpy as np
+import torch
+import torch.nn.functional as F
+from scipy import linalg
+import math
+from data_utils.rotation_conversion import axis_angle_to_matrix, matrix_to_rotation_6d
+import warnings
+warnings.filterwarnings("ignore", category=RuntimeWarning)  # ignore warnings
+change_angle = torch.tensor([6.0181e-05, 5.1597e-05, 2.1344e-04, 2.1899e-04])
+class EmbeddingSpaceEvaluator:
+    def __init__(self, ae, vae, device):
+        # init embed net
+        self.ae = ae
+        # self.vae = vae
+        # storage
+        self.real_feat_list = []
+        self.generated_feat_list = []
+        self.real_joints_list = []
+        self.generated_joints_list = []
+        self.real_6d_list = []
+        self.generated_6d_list = []
+        self.audio_beat_list = []
+    def reset(self):
+        self.real_feat_list = []
+        self.generated_feat_list = []
+    def get_no_of_samples(self):
+        return len(self.real_feat_list)
+    def push_samples(self, generated_poses, real_poses):
+        # self.net.eval()
+        # convert poses to latent features
+        real_feat, real_poses = self.ae.extract(real_poses)
+        generated_feat, generated_poses = self.ae.extract(generated_poses)
+        num_joints = real_poses.shape[2] // 3
+        real_feat = real_feat.squeeze()
+        generated_feat = generated_feat.reshape(generated_feat.shape[0]*generated_feat.shape[1], -1)
+        self.real_feat_list.append(real_feat.data.cpu().numpy())
+        self.generated_feat_list.append(generated_feat.data.cpu().numpy())
+        # real_poses = matrix_to_rotation_6d(axis_angle_to_matrix(real_poses.reshape(-1, 3))).reshape(-1, num_joints, 6)
+        # generated_poses = matrix_to_rotation_6d(axis_angle_to_matrix(generated_poses.reshape(-1, 3))).reshape(-1, num_joints, 6)
+        #
+        # self.real_feat_list.append(real_poses.data.cpu().numpy())
+        # self.generated_feat_list.append(generated_poses.data.cpu().numpy())
+    def push_joints(self, generated_poses, real_poses):
+        self.real_joints_list.append(real_poses.data.cpu())
+        self.generated_joints_list.append(generated_poses.squeeze().data.cpu())
+    def push_aud(self, aud):
+        self.audio_beat_list.append(aud.squeeze().data.cpu())
+    def get_MAAC(self):
+        ang_vel_list = []
+        for real_joints in self.real_joints_list:
+            real_joints[:, 15:21] = real_joints[:, 16:22]
+            vec = real_joints[:, 15:21] - real_joints[:, 13:19]
+            inner_product = torch.einsum('kij,kij->ki', [vec[:, 2:], vec[:, :-2]])
+            inner_product = torch.clamp(inner_product, -1, 1, out=None)
+            angle = torch.acos(inner_product) / math.pi
+            ang_vel = (angle[1:] - angle[:-1]).abs().mean(dim=0)
+            ang_vel_list.append(ang_vel.unsqueeze(dim=0))
+        all_vel = torch.cat(ang_vel_list, dim=0)
+        MAAC = all_vel.mean(dim=0)
+        return MAAC
+    def get_BCscore(self):
+        thres = 0.01
+        sigma = 0.1
+        sum_1 = 0
+        total_beat = 0
+        for joints, audio_beat_time in zip(self.generated_joints_list, self.audio_beat_list):
+            motion_beat_time = []
+            if joints.dim() == 4:
+                joints = joints[0]
+            joints[:, 15:21] = joints[:, 16:22]
+            vec = joints[:, 15:21] - joints[:, 13:19]
+            inner_product = torch.einsum('kij,kij->ki', [vec[:, 2:], vec[:, :-2]])
+            inner_product = torch.clamp(inner_product, -1, 1, out=None)
+            angle = torch.acos(inner_product) / math.pi
+            ang_vel = (angle[1:] - angle[:-1]).abs() / change_angle / len(change_angle)
+            angle_diff = torch.cat((torch.zeros(1, 4), ang_vel), dim=0)
+            sum_2 = 0
+            for i in range(angle_diff.shape[1]):
+                motion_beat_time = []
+                for t in range(1, joints.shape[0]-1):
+                    if (angle_diff[t][i] < angle_diff[t - 1][i] and angle_diff[t][i] < angle_diff[t + 1][i]):
+                        if (angle_diff[t - 1][i] - angle_diff[t][i] >= thres or angle_diff[t + 1][i] - angle_diff[
+                            t][i] >= thres):
+                            motion_beat_time.append(float(t) / 30.0)
+                if (len(motion_beat_time) == 0):
+                    continue
+                motion_beat_time = torch.tensor(motion_beat_time)
+                sum = 0
+                for audio in audio_beat_time:
+                    sum += np.power(math.e, -(np.power((audio.item() - motion_beat_time), 2)).min() / (2 * sigma * sigma))
+                sum_2 = sum_2 + sum
+                total_beat = total_beat + len(audio_beat_time)
+            sum_1 = sum_1 + sum_2
+        return sum_1/total_beat
+    def get_scores(self):
+        generated_feats = np.vstack(self.generated_feat_list)
+        real_feats = np.vstack(self.real_feat_list)
+        def frechet_distance(samples_A, samples_B):
+            A_mu = np.mean(samples_A, axis=0)
+            A_sigma = np.cov(samples_A, rowvar=False)
+            B_mu = np.mean(samples_B, axis=0)
+            B_sigma = np.cov(samples_B, rowvar=False)
+            try:
+                frechet_dist = self.calculate_frechet_distance(A_mu, A_sigma, B_mu, B_sigma)
+            except ValueError:
+                frechet_dist = 1e+10
+            return frechet_dist
+        ####################################################################
+        # frechet distance
+        frechet_dist = frechet_distance(generated_feats, real_feats)
+        ####################################################################
+        # distance between real and generated samples on the latent feature space
+        dists = []
+        for i in range(real_feats.shape[0]):
+            d = np.sum(np.absolute(real_feats[i] - generated_feats[i]))  # MAE
+            dists.append(d)
+        feat_dist = np.mean(dists)
+        return frechet_dist, feat_dist
+    @staticmethod
+    def calculate_frechet_distance(mu1, sigma1, mu2, sigma2, eps=1e-6):
+        """ from https://github.com/mseitzer/pytorch-fid/blob/master/fid_score.py """
+        """Numpy implementation of the Frechet Distance.
+        The Frechet distance between two multivariate Gaussians X_1 ~ N(mu_1, C_1)
+        and X_2 ~ N(mu_2, C_2) is
+                d^2 = ||mu_1 - mu_2||^2 + Tr(C_1 + C_2 - 2*sqrt(C_1*C_2)).
+        Stable version by Dougal J. Sutherland.
+        Params:
+        -- mu1   : Numpy array containing the activations of a layer of the
+                   inception net (like returned by the function 'get_predictions')
+                   for generated samples.
+        -- mu2   : The sample mean over activations, precalculated on an
+                   representative data set.
+        -- sigma1: The covariance matrix over activations for generated samples.
+        -- sigma2: The covariance matrix over activations, precalculated on an
+                   representative data set.
+        Returns:
+        --   : The Frechet Distance.
+        """
+        mu1 = np.atleast_1d(mu1)
+        mu2 = np.atleast_1d(mu2)
+        sigma1 = np.atleast_2d(sigma1)
+        sigma2 = np.atleast_2d(sigma2)
+        assert mu1.shape == mu2.shape, \
+            'Training and test mean vectors have different lengths'
+        assert sigma1.shape == sigma2.shape, \
+            'Training and test covariances have different dimensions'
+        diff = mu1 - mu2
+        # Product might be almost singular
+        covmean, _ = linalg.sqrtm(sigma1.dot(sigma2), disp=False)
+        if not np.isfinite(covmean).all():
+            msg = ('fid calculation produces singular product; '
+                   'adding %s to diagonal of cov estimates') % eps
+            print(msg)
+            offset = np.eye(sigma1.shape[0]) * eps
+            covmean = linalg.sqrtm((sigma1 + offset).dot(sigma2 + offset))
+        # Numerical error might give slight imaginary component
+        if np.iscomplexobj(covmean):
+            if not np.allclose(np.diagonal(covmean).imag, 0, atol=1e-3):
+                m = np.max(np.abs(covmean.imag))
+                raise ValueError('Imaginary component {}'.format(m))
+            covmean = covmean.real
+        tr_covmean = np.trace(covmean)
+        return (diff.dot(diff) + np.trace(sigma1) +
+                np.trace(sigma2) - 2 * tr_covmean)

evaluation/__init__.py ADDED Viewed

File without changes

evaluation/__pycache__/__init__.cpython-37.pyc ADDED Viewed

Binary file (181 Bytes). View file

evaluation/__pycache__/metrics.cpython-37.pyc ADDED Viewed

Binary file (3.81 kB). View file

evaluation/diversity_LVD.py ADDED Viewed

	@@ -0,0 +1,64 @@

+'''
+LVD: different initial pose
+diversity: same initial pose
+'''
+import os
+import sys
+sys.path.append(os.getcwd())
+from glob import glob
+from argparse import ArgumentParser
+import json
+from evaluation.util import *
+from evaluation.metrics import *
+from tqdm import tqdm
+parser = ArgumentParser()
+parser.add_argument('--speaker', required=True, type=str)
+parser.add_argument('--post_fix', nargs='+', default=['base'], type=str)
+args = parser.parse_args()
+speaker = args.speaker
+test_audios = sorted(glob('pose_dataset/videos/test_audios/%s/*.wav'%(speaker)))
+LVD_list = []
+diversity_list = []
+for aud in tqdm(test_audios):
+    base_name = os.path.splitext(aud)[0]
+    gt_path = get_full_path(aud, speaker, 'val')
+    _, gt_poses, _ = get_gts(gt_path)
+    gt_poses = gt_poses[np.newaxis,...]
+    # print(gt_poses.shape)#(seq_len, 135*2)pose, lhand, rhand, face
+    for post_fix in args.post_fix:
+        pred_path = base_name + '_'+post_fix+'.json'
+        pred_poses = np.array(json.load(open(pred_path)))
+        # print(pred_poses.shape)#(B, seq_len, 108)
+        pred_poses = cvt25(pred_poses, gt_poses)
+        # print(pred_poses.shape)#(B, seq, pose_dim)
+        gt_valid_points = hand_points(gt_poses)
+        pred_valid_points = hand_points(pred_poses)
+        lvd = LVD(gt_valid_points, pred_valid_points)
+        # div = diversity(pred_valid_points)
+        LVD_list.append(lvd)
+        # diversity_list.append(div)
+        # gt_velocity = peak_velocity(gt_valid_points, order=2)
+        # pred_velocity = peak_velocity(pred_valid_points, order=2)
+        # gt_consistency = velocity_consistency(gt_velocity, pred_velocity)
+        # pred_consistency = velocity_consistency(pred_velocity, gt_velocity)
+        # gt_consistency_list.append(gt_consistency)
+        # pred_consistency_list.append(pred_consistency)
+lvd = np.mean(LVD_list)
+# diversity_list = np.mean(diversity_list)
+print('LVD:', lvd)
+# print("diversity:", diversity_list)

evaluation/get_quality_samples.py ADDED Viewed

	@@ -0,0 +1,62 @@

+'''
+'''
+import os
+import sys
+sys.path.append(os.getcwd())
+from glob import glob
+from argparse import ArgumentParser
+import json
+from evaluation.util import *
+from evaluation.metrics import *
+from tqdm import tqdm
+parser = ArgumentParser()
+parser.add_argument('--speaker', required=True, type=str)
+parser.add_argument('--post_fix', nargs='+', default=['paper_model'], type=str)
+args = parser.parse_args()
+speaker = args.speaker
+test_audios = sorted(glob('pose_dataset/videos/test_audios/%s/*.wav'%(speaker)))
+quality_samples={'gt':[]}
+for post_fix in args.post_fix:
+    quality_samples[post_fix] = []
+for aud in tqdm(test_audios):
+    base_name = os.path.splitext(aud)[0]
+    gt_path = get_full_path(aud, speaker, 'val')
+    _, gt_poses, _ = get_gts(gt_path)
+    gt_poses = gt_poses[np.newaxis,...]
+    gt_valid_points = valid_points(gt_poses)
+    # print(gt_valid_points.shape)
+    quality_samples['gt'].append(gt_valid_points)
+    for post_fix in args.post_fix:
+        pred_path = base_name + '_'+post_fix+'.json'
+        pred_poses = np.array(json.load(open(pred_path)))
+        # print(pred_poses.shape)#(B, seq_len, 108)
+        pred_poses = cvt25(pred_poses, gt_poses)
+        # print(pred_poses.shape)#(B, seq, pose_dim)
+        pred_valid_points = valid_points(pred_poses)[0:1]
+        quality_samples[post_fix].append(pred_valid_points)
+quality_samples['gt'] = np.concatenate(quality_samples['gt'], axis=1)
+for post_fix in args.post_fix:
+    quality_samples[post_fix] = np.concatenate(quality_samples[post_fix], axis=1)
+print('gt:', quality_samples['gt'].shape)
+quality_samples['gt'] = quality_samples['gt'].tolist()
+for post_fix in args.post_fix:
+    print(post_fix, ':', quality_samples[post_fix].shape)
+    quality_samples[post_fix] = quality_samples[post_fix].tolist()
+save_dir = '../../experiments/'
+os.makedirs(save_dir, exist_ok=True)
+save_name = os.path.join(save_dir, 'quality_samples_%s.json'%(speaker))
+with open(save_name, 'w') as f:
+    json.dump(quality_samples, f)

evaluation/metrics.py ADDED Viewed

	@@ -0,0 +1,109 @@

+'''
+Warning: metrics are for reference only, may have limited significance
+'''
+import os
+import sys
+sys.path.append(os.getcwd())
+import numpy as np
+import torch
+from data_utils.lower_body import rearrange, symmetry
+import torch.nn.functional as F
+def data_driven_baselines(gt_kps):
+    '''
+    gt_kps: T, D
+    '''
+    gt_velocity = np.abs(gt_kps[1:] - gt_kps[:-1])
+    mean= np.mean(gt_velocity, axis=0)[np.newaxis] #(1, D)
+    mean = np.mean(np.abs(gt_velocity-mean))
+    last_step = gt_kps[1] - gt_kps[0]
+    last_step = last_step[np.newaxis] #(1, D)
+    last_step = np.mean(np.abs(gt_velocity-last_step))
+    return last_step, mean
+def Batch_LVD(gt_kps, pr_kps, symmetrical, weight):
+    if gt_kps.shape[0] > pr_kps.shape[1]:
+        length = pr_kps.shape[1]
+    else:
+        length = gt_kps.shape[0]
+    gt_kps = gt_kps[:length]
+    pr_kps = pr_kps[:, :length]
+    global symmetry
+    symmetry = torch.tensor(symmetry).bool()
+    if symmetrical:
+        # rearrange for compute symmetric. ns means non-symmetrical joints, ys means symmetrical joints.
+        gt_kps = gt_kps[:, rearrange]
+        ns_gt_kps = gt_kps[:, ~symmetry]
+        ys_gt_kps = gt_kps[:, symmetry]
+        ys_gt_kps = ys_gt_kps.reshape(ys_gt_kps.shape[0], -1, 2, 3)
+        ns_gt_velocity = (ns_gt_kps[1:] - ns_gt_kps[:-1]).norm(p=2, dim=-1)
+        ys_gt_velocity = (ys_gt_kps[1:] - ys_gt_kps[:-1]).norm(p=2, dim=-1)
+        left_gt_vel = ys_gt_velocity[:, :, 0].sum(dim=-1)
+        right_gt_vel = ys_gt_velocity[:, :, 1].sum(dim=-1)
+        move_side = torch.where(left_gt_vel>right_gt_vel, torch.ones(left_gt_vel.shape).cuda(),  torch.zeros(left_gt_vel.shape).cuda())
+        ys_gt_velocity = torch.mul(ys_gt_velocity[:, :, 0].transpose(0,1), move_side) + torch.mul(ys_gt_velocity[:, :, 1].transpose(0,1), ~move_side.bool())
+        ys_gt_velocity = ys_gt_velocity.transpose(0,1)
+        gt_velocity = torch.cat([ns_gt_velocity, ys_gt_velocity], dim=1)
+        pr_kps = pr_kps[:, :, rearrange]
+        ns_pr_kps = pr_kps[:, :, ~symmetry]
+        ys_pr_kps = pr_kps[:, :, symmetry]
+        ys_pr_kps = ys_pr_kps.reshape(ys_pr_kps.shape[0], ys_pr_kps.shape[1], -1, 2, 3)
+        ns_pr_velocity = (ns_pr_kps[:, 1:] - ns_pr_kps[:, :-1]).norm(p=2, dim=-1)
+        ys_pr_velocity = (ys_pr_kps[:, 1:] - ys_pr_kps[:, :-1]).norm(p=2, dim=-1)
+        left_pr_vel = ys_pr_velocity[:, :, :, 0].sum(dim=-1)
+        right_pr_vel = ys_pr_velocity[:, :, :, 1].sum(dim=-1)
+        move_side = torch.where(left_pr_vel > right_pr_vel, torch.ones(left_pr_vel.shape).cuda(),
+                                torch.zeros(left_pr_vel.shape).cuda())
+        ys_pr_velocity = torch.mul(ys_pr_velocity[..., 0].permute(2, 0, 1), move_side) + torch.mul(
+            ys_pr_velocity[..., 1].permute(2, 0, 1), ~move_side.long())
+        ys_pr_velocity = ys_pr_velocity.permute(1, 2, 0)
+        pr_velocity = torch.cat([ns_pr_velocity, ys_pr_velocity], dim=2)
+    else:
+        gt_velocity = (gt_kps[1:] - gt_kps[:-1]).norm(p=2, dim=-1)
+        pr_velocity = (pr_kps[:, 1:] - pr_kps[:, :-1]).norm(p=2, dim=-1)
+    if weight:
+        w = F.softmax(gt_velocity.sum(dim=1).normal_(), dim=0)
+    else:
+        w = 1 / gt_velocity.shape[0]
+    v_diff = ((pr_velocity - gt_velocity).abs().sum(dim=-1) * w).sum(dim=-1).mean()
+    return v_diff
+def LVD(gt_kps, pr_kps, symmetrical=False, weight=False):
+    gt_kps = gt_kps.squeeze()
+    pr_kps = pr_kps.squeeze()
+    if len(pr_kps.shape) == 4:
+        return Batch_LVD(gt_kps, pr_kps, symmetrical, weight)
+    # length = np.minimum(gt_kps.shape[0], pr_kps.shape[0])
+    length = gt_kps.shape[0]-10
+    # gt_kps = gt_kps[25:length]
+    # pr_kps = pr_kps[25:length] #(T, D)
+    # if pr_kps.shape[0] < gt_kps.shape[0]:
+    #     pr_kps = np.pad(pr_kps, [[0, int(gt_kps.shape[0]-pr_kps.shape[0])], [0, 0]], mode='constant')
+    gt_velocity = (gt_kps[1:] - gt_kps[:-1]).norm(p=2, dim=-1)
+    pr_velocity = (pr_kps[1:] - pr_kps[:-1]).norm(p=2, dim=-1)
+    return (pr_velocity-gt_velocity).abs().sum(dim=-1).mean()
+def diversity(kps):
+    '''
+    kps: bs, seq, dim
+    '''
+    dis_list = []
+    #the distance between each pair
+    for i in range(kps.shape[0]):
+        for j in range(i+1, kps.shape[0]):
+            seq_i = kps[i]
+            seq_j = kps[j]
+            dis = np.mean(np.abs(seq_i - seq_j))
+            dis_list.append(dis)
+    return np.mean(dis_list)

evaluation/mode_transition.py ADDED Viewed

	@@ -0,0 +1,60 @@

+import os
+import sys
+sys.path.append(os.getcwd())
+from glob import glob
+from argparse import ArgumentParser
+import json
+from evaluation.util import *
+from evaluation.metrics import *
+from tqdm import tqdm
+parser = ArgumentParser()
+parser.add_argument('--speaker', required=True, type=str)
+parser.add_argument('--post_fix', nargs='+', default=['paper_model'], type=str)
+args = parser.parse_args()
+speaker = args.speaker
+test_audios = sorted(glob('pose_dataset/videos/test_audios/%s/*.wav'%(speaker)))
+precision_list=[]
+recall_list=[]
+accuracy_list=[]
+for aud in tqdm(test_audios):
+    base_name = os.path.splitext(aud)[0]
+    gt_path = get_full_path(aud, speaker, 'val')
+    _, gt_poses, _ = get_gts(gt_path)
+    if gt_poses.shape[0] < 50:
+        continue
+    gt_poses = gt_poses[np.newaxis,...]
+    # print(gt_poses.shape)#(seq_len, 135*2)pose, lhand, rhand, face
+    for post_fix in args.post_fix:
+        pred_path = base_name + '_'+post_fix+'.json'
+        pred_poses = np.array(json.load(open(pred_path)))
+        # print(pred_poses.shape)#(B, seq_len, 108)
+        pred_poses = cvt25(pred_poses, gt_poses)
+        # print(pred_poses.shape)#(B, seq, pose_dim)
+        gt_valid_points = valid_points(gt_poses)
+        pred_valid_points = valid_points(pred_poses)
+        # print(gt_valid_points.shape, pred_valid_points.shape)
+        gt_mode_transition_seq = mode_transition_seq(gt_valid_points, speaker)#(B, N)
+        pred_mode_transition_seq = mode_transition_seq(pred_valid_points, speaker)#(B, N)
+        # baseline = np.random.randint(0, 2, size=pred_mode_transition_seq.shape)
+        # pred_mode_transition_seq = baseline
+        precision, recall, accuracy = mode_transition_consistency(pred_mode_transition_seq, gt_mode_transition_seq)
+        precision_list.append(precision)
+        recall_list.append(recall)
+        accuracy_list.append(accuracy)
+print(len(precision_list), len(recall_list), len(accuracy_list))
+precision_list = np.mean(precision_list)
+recall_list = np.mean(recall_list)
+accuracy_list = np.mean(accuracy_list)
+print('precision, recall, accu:', precision_list, recall_list, accuracy_list)

evaluation/peak_velocity.py ADDED Viewed

	@@ -0,0 +1,65 @@

+import os
+import sys
+sys.path.append(os.getcwd())
+from glob import glob
+from argparse import ArgumentParser
+import json
+from evaluation.util import *
+from evaluation.metrics import *
+from tqdm import tqdm
+parser = ArgumentParser()
+parser.add_argument('--speaker', required=True, type=str)
+parser.add_argument('--post_fix', nargs='+', default=['paper_model'], type=str)
+args = parser.parse_args()
+speaker = args.speaker
+test_audios = sorted(glob('pose_dataset/videos/test_audios/%s/*.wav'%(speaker)))
+gt_consistency_list=[]
+pred_consistency_list=[]
+for aud in tqdm(test_audios):
+    base_name = os.path.splitext(aud)[0]
+    gt_path = get_full_path(aud, speaker, 'val')
+    _, gt_poses, _ = get_gts(gt_path)
+    gt_poses = gt_poses[np.newaxis,...]
+    # print(gt_poses.shape)#(seq_len, 135*2)pose, lhand, rhand, face
+    for post_fix in args.post_fix:
+        pred_path = base_name + '_'+post_fix+'.json'
+        pred_poses = np.array(json.load(open(pred_path)))
+        # print(pred_poses.shape)#(B, seq_len, 108)
+        pred_poses = cvt25(pred_poses, gt_poses)
+        # print(pred_poses.shape)#(B, seq, pose_dim)
+        gt_valid_points = hand_points(gt_poses)
+        pred_valid_points = hand_points(pred_poses)
+        gt_velocity = peak_velocity(gt_valid_points, order=2)
+        pred_velocity = peak_velocity(pred_valid_points, order=2)
+        gt_consistency = velocity_consistency(gt_velocity, pred_velocity)
+        pred_consistency = velocity_consistency(pred_velocity, gt_velocity)
+        gt_consistency_list.append(gt_consistency)
+        pred_consistency_list.append(pred_consistency)
+gt_consistency_list = np.concatenate(gt_consistency_list)
+pred_consistency_list = np.concatenate(pred_consistency_list)
+print(gt_consistency_list.max(), gt_consistency_list.min())
+print(pred_consistency_list.max(), pred_consistency_list.min())
+print(np.mean(gt_consistency_list), np.mean(pred_consistency_list))
+print(np.std(gt_consistency_list), np.std(pred_consistency_list))
+draw_cdf(gt_consistency_list, save_name='%s_gt.jpg'%(speaker), color='slateblue')
+draw_cdf(pred_consistency_list, save_name='%s_pred.jpg'%(speaker), color='lightskyblue')
+to_excel(gt_consistency_list, '%s_gt.xlsx'%(speaker))
+to_excel(pred_consistency_list, '%s_pred.xlsx'%(speaker))
+np.save('%s_gt.npy'%(speaker), gt_consistency_list)
+np.save('%s_pred.npy'%(speaker), pred_consistency_list)

evaluation/util.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import os
+from glob import glob
+import numpy as np
+import json
+from matplotlib import pyplot as plt
+import pandas as pd
+def get_gts(clip):
+    '''
+    clip: abs path to the clip dir
+    '''
+    keypoints_files = sorted(glob(os.path.join(clip, 'keypoints_new/person_1')+'/*.json'))
+    upper_body_points = list(np.arange(0, 25))
+    poses = []
+    confs = []
+    neck_to_nose_len = []
+    mean_position = []
+    for kp_file in keypoints_files:
+        kp_load = json.load(open(kp_file, 'r'))['people'][0]
+        posepts = kp_load['pose_keypoints_2d']
+        lhandpts = kp_load['hand_left_keypoints_2d']
+        rhandpts = kp_load['hand_right_keypoints_2d']
+        facepts = kp_load['face_keypoints_2d']
+        neck = np.array(posepts).reshape(-1,3)[1]
+        nose = np.array(posepts).reshape(-1,3)[0]
+        x_offset = abs(neck[0]-nose[0])
+        y_offset = abs(neck[1]-nose[1])
+        neck_to_nose_len.append(y_offset)
+        mean_position.append([neck[0],neck[1]])
+        keypoints=np.array(posepts+lhandpts+rhandpts+facepts).reshape(-1,3)[:,:2]
+        upper_body = keypoints[upper_body_points, :]
+        hand_points = keypoints[25:, :]
+        keypoints = np.vstack([upper_body, hand_points])
+        poses.append(keypoints)
+    if len(neck_to_nose_len) > 0:
+        scale_factor = np.mean(neck_to_nose_len)
+    else:
+        raise ValueError(clip)
+    mean_position = np.mean(np.array(mean_position), axis=0)
+    unlocalized_poses = np.array(poses).copy()
+    localized_poses = []
+    for i in range(len(poses)):
+        keypoints = poses[i]
+        neck = keypoints[1].copy()
+        keypoints[:, 0] = (keypoints[:, 0] - neck[0]) / scale_factor
+        keypoints[:, 1] = (keypoints[:, 1] - neck[1]) / scale_factor
+        localized_poses.append(keypoints.reshape(-1))
+    localized_poses=np.array(localized_poses)
+    return unlocalized_poses, localized_poses, (scale_factor, mean_position)
+def get_full_path(wav_name, speaker, split):
+    '''
+    get clip path from aud file
+    '''
+    wav_name = os.path.basename(wav_name)
+    wav_name = os.path.splitext(wav_name)[0]
+    clip_name, vid_name = wav_name[:10], wav_name[11:]
+    full_path = os.path.join('pose_dataset/videos/', speaker, 'clips', vid_name, 'images/half', split, clip_name)
+    assert os.path.isdir(full_path), full_path
+    return full_path
+def smooth(res):
+    '''
+    res: (B, seq_len, pose_dim)
+    '''
+    window = [res[:, 7, :], res[:, 8, :], res[:, 9, :], res[:, 10, :], res[:, 11, :], res[:, 12, :]]
+    w_size=7
+    for i in range(10, res.shape[1]-3):
+        window.append(res[:, i+3, :])
+        if len(window) > w_size:
+            window = window[1:]
+        if (i%25) in [22, 23, 24, 0, 1, 2, 3]:
+            res[:, i, :] = np.mean(window, axis=1)
+    return res
+def cvt25(pred_poses, gt_poses=None):
+    '''
+    gt_poses: (1, seq_len, 270), 135 *2
+    pred_poses: (B, seq_len, 108), 54 * 2
+    '''
+    if gt_poses is None:
+        gt_poses = np.zeros_like(pred_poses)
+    else:
+        gt_poses = gt_poses.repeat(pred_poses.shape[0], axis=0)
+    length = min(pred_poses.shape[1], gt_poses.shape[1])
+    pred_poses = pred_poses[:, :length, :]
+    gt_poses = gt_poses[:, :length, :]
+    gt_poses = gt_poses.reshape(gt_poses.shape[0], gt_poses.shape[1], -1, 2)
+    pred_poses = pred_poses.reshape(pred_poses.shape[0], pred_poses.shape[1], -1, 2)
+    gt_poses[:, :, [1, 2, 3, 4, 5, 6, 7], :] = pred_poses[:, :, 1:8, :]
+    gt_poses[:, :, 25:25+21+21, :] = pred_poses[:, :, 12:, :]
+    return gt_poses.reshape(gt_poses.shape[0], gt_poses.shape[1], -1)
+def hand_points(seq):
+    '''
+    seq: (B, seq_len, 135*2)
+    hands only
+    '''
+    hand_idx = [1, 2, 3, 4,5 ,6,7] + list(range(25, 25+21+21))
+    seq = seq.reshape(seq.shape[0], seq.shape[1], -1, 2)
+    return seq[:, :, hand_idx, :].reshape(seq.shape[0], seq.shape[1], -1)
+def valid_points(seq):
+    '''
+    hands with some head points
+    '''
+    valid_idx = [0, 1, 2, 3, 4,5 ,6,7, 8, 9, 10, 11] + list(range(25, 25+21+21))
+    seq = seq.reshape(seq.shape[0], seq.shape[1], -1, 2)
+    seq = seq[:, :, valid_idx, :].reshape(seq.shape[0], seq.shape[1], -1)
+    assert seq.shape[-1] == 108, seq.shape
+    return seq
+def draw_cdf(seq, save_name='cdf.jpg', color='slatebule'):
+    plt.figure()
+    plt.hist(seq, bins=100, range=(0, 100), color=color)
+    plt.savefig(save_name)
+def to_excel(seq, save_name='res.xlsx'):
+    '''
+    seq: (T)
+    '''
+    df = pd.DataFrame(seq)
+    writer = pd.ExcelWriter(save_name)
+    df.to_excel(writer, 'sheet1')
+    writer.save()
+    writer.close()
+if __name__ == '__main__':
+    random_data = np.random.randint(0, 10, 100)
+    draw_cdf(random_data)

losses/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .losses import *

losses/__pycache__/__init__.cpython-37.pyc ADDED Viewed

Binary file (174 Bytes). View file

losses/__pycache__/losses.cpython-37.pyc ADDED Viewed

Binary file (3.53 kB). View file

losses/losses.py ADDED Viewed

	@@ -0,0 +1,91 @@

+import os
+import sys
+sys.path.append(os.getcwd())
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+class KeypointLoss(nn.Module):
+    def __init__(self):
+        super(KeypointLoss, self).__init__()
+    def forward(self, pred_seq, gt_seq, gt_conf=None):
+        #pred_seq: (B, C, T)
+        if gt_conf is not None:
+            gt_conf = gt_conf >= 0.01
+            return F.mse_loss(pred_seq[gt_conf], gt_seq[gt_conf], reduction='mean')
+        else:
+            return F.mse_loss(pred_seq, gt_seq)
+class KLLoss(nn.Module):
+    def __init__(self, kl_tolerance):
+        super(KLLoss, self).__init__()
+        self.kl_tolerance = kl_tolerance
+    def forward(self, mu, var, mul=1):
+        kl_tolerance = self.kl_tolerance * mul * var.shape[1] / 64
+        kld_loss = -0.5 * torch.sum(1 + var - mu**2 - var.exp(), dim=1)
+        # kld_loss = -0.5 * torch.sum(1 + (var-1) - (mu) ** 2 - (var-1).exp(), dim=1)
+        if self.kl_tolerance is not None:
+            # above_line = kld_loss[kld_loss > self.kl_tolerance]
+            # if len(above_line) > 0:
+            #     kld_loss = torch.mean(kld_loss)
+            # else:
+            #     kld_loss = 0
+            kld_loss = torch.where(kld_loss > kl_tolerance, kld_loss, torch.tensor(kl_tolerance, device='cuda'))
+        # else:
+        kld_loss = torch.mean(kld_loss)
+        return kld_loss
+class L2KLLoss(nn.Module):
+    def __init__(self, kl_tolerance):
+        super(L2KLLoss, self).__init__()
+        self.kl_tolerance = kl_tolerance
+    def forward(self, x):
+        # TODO: check
+        kld_loss = torch.sum(x ** 2, dim=1)
+        if self.kl_tolerance is not None:
+            above_line = kld_loss[kld_loss > self.kl_tolerance]
+            if len(above_line) > 0:
+                kld_loss = torch.mean(kld_loss)
+            else:
+                kld_loss = 0
+        else:
+            kld_loss = torch.mean(kld_loss)
+        return kld_loss
+class L2RegLoss(nn.Module):
+    def __init__(self):
+        super(L2RegLoss, self).__init__()
+    def forward(self, x):
+        #TODO: check
+        return torch.sum(x**2)
+class L2Loss(nn.Module):
+    def __init__(self):
+        super(L2Loss, self).__init__()
+    def forward(self, x):
+        # TODO: check
+        return torch.sum(x ** 2)
+class AudioLoss(nn.Module):
+    def __init__(self):
+        super(AudioLoss, self).__init__()
+    def forward(self, dynamics, gt_poses):
+        #pay attention, normalized
+        mean = torch.mean(gt_poses, dim=-1).unsqueeze(-1)
+        gt = gt_poses - mean
+        return F.mse_loss(dynamics, gt)
+L1Loss = nn.L1Loss

nets/LS3DCG.py ADDED Viewed

	@@ -0,0 +1,414 @@

+'''
+not exactly the same as the official repo but the results are good
+'''
+import sys
+import os
+from data_utils.lower_body import c_index_3d, c_index_6d
+sys.path.append(os.getcwd())
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import torch.nn.functional as F
+import math
+from nets.base import TrainWrapperBaseClass
+from nets.layers import SeqEncoder1D
+from losses import KeypointLoss, L1Loss, KLLoss
+from data_utils.utils import get_melspec, get_mfcc_psf, get_mfcc_ta
+from nets.utils import denormalize
+class Conv1d_tf(nn.Conv1d):
+    """
+    Conv1d with the padding behavior from TF
+    modified from https://github.com/mlperf/inference/blob/482f6a3beb7af2fb0bd2d91d6185d5e71c22c55f/others/edge/object_detection/ssd_mobilenet/pytorch/utils.py
+    """
+    def __init__(self, *args, **kwargs):
+        super(Conv1d_tf, self).__init__(*args, **kwargs)
+        self.padding = kwargs.get("padding", "same")
+    def _compute_padding(self, input, dim):
+        input_size = input.size(dim + 2)
+        filter_size = self.weight.size(dim + 2)
+        effective_filter_size = (filter_size - 1) * self.dilation[dim] + 1
+        out_size = (input_size + self.stride[dim] - 1) // self.stride[dim]
+        total_padding = max(
+            0, (out_size - 1) * self.stride[dim] + effective_filter_size - input_size
+        )
+        additional_padding = int(total_padding % 2 != 0)
+        return additional_padding, total_padding
+    def forward(self, input):
+        if self.padding == "VALID":
+            return F.conv1d(
+                input,
+                self.weight,
+                self.bias,
+                self.stride,
+                padding=0,
+                dilation=self.dilation,
+                groups=self.groups,
+            )
+        rows_odd, padding_rows = self._compute_padding(input, dim=0)
+        if rows_odd:
+            input = F.pad(input, [0, rows_odd])
+        return F.conv1d(
+            input,
+            self.weight,
+            self.bias,
+            self.stride,
+            padding=(padding_rows // 2),
+            dilation=self.dilation,
+            groups=self.groups,
+        )
+def ConvNormRelu(in_channels, out_channels, type='1d', downsample=False, k=None, s=None, norm='bn', padding='valid'):
+    if k is None and s is None:
+        if not downsample:
+            k = 3
+            s = 1
+        else:
+            k = 4
+            s = 2
+    if type == '1d':
+        conv_block = Conv1d_tf(in_channels, out_channels, kernel_size=k, stride=s, padding=padding)
+        if norm == 'bn':
+            norm_block = nn.BatchNorm1d(out_channels)
+        elif norm == 'ln':
+            norm_block = nn.LayerNorm(out_channels)
+    elif type == '2d':
+        conv_block = Conv2d_tf(in_channels, out_channels, kernel_size=k, stride=s, padding=padding)
+        norm_block = nn.BatchNorm2d(out_channels)
+    else:
+        assert False
+    return nn.Sequential(
+        conv_block,
+        norm_block,
+        nn.LeakyReLU(0.2, True)
+    )
+class Decoder(nn.Module):
+    def __init__(self, in_ch, out_ch):
+        super(Decoder, self).__init__()
+        self.up1 = nn.Sequential(
+            ConvNormRelu(in_ch // 2 + in_ch, in_ch // 2),
+            ConvNormRelu(in_ch // 2, in_ch // 2),
+            nn.Upsample(scale_factor=2, mode='nearest')
+        )
+        self.up2 = nn.Sequential(
+            ConvNormRelu(in_ch // 4 + in_ch // 2, in_ch // 4),
+            ConvNormRelu(in_ch // 4, in_ch // 4),
+            nn.Upsample(scale_factor=2, mode='nearest')
+        )
+        self.up3 = nn.Sequential(
+            ConvNormRelu(in_ch // 8 + in_ch // 4, in_ch // 8),
+            ConvNormRelu(in_ch // 8, in_ch // 8),
+            nn.Conv1d(in_ch // 8, out_ch, 1, 1)
+        )
+    def forward(self, x, x1, x2, x3):
+        x = F.interpolate(x, x3.shape[2])
+        x = torch.cat([x, x3], dim=1)
+        x = self.up1(x)
+        x = F.interpolate(x, x2.shape[2])
+        x = torch.cat([x, x2], dim=1)
+        x = self.up2(x)
+        x = F.interpolate(x, x1.shape[2])
+        x = torch.cat([x, x1], dim=1)
+        x = self.up3(x)
+        return x
+class EncoderDecoder(nn.Module):
+    def __init__(self, n_frames, each_dim):
+        super().__init__()
+        self.n_frames = n_frames
+        self.down1 = nn.Sequential(
+            ConvNormRelu(64, 64, '1d', False),
+            ConvNormRelu(64, 128, '1d', False),
+        )
+        self.down2 = nn.Sequential(
+            ConvNormRelu(128, 128, '1d', False),
+            ConvNormRelu(128, 256, '1d', False),
+        )
+        self.down3 = nn.Sequential(
+            ConvNormRelu(256, 256, '1d', False),
+            ConvNormRelu(256, 512, '1d', False),
+        )
+        self.down4 = nn.Sequential(
+            ConvNormRelu(512, 512, '1d', False),
+            ConvNormRelu(512, 1024, '1d', False),
+        )
+        self.down = nn.MaxPool1d(kernel_size=2)
+        self.up = nn.Upsample(scale_factor=2, mode='nearest')
+        self.face_decoder = Decoder(1024, each_dim[0] + each_dim[3])
+        self.body_decoder = Decoder(1024, each_dim[1])
+        self.hand_decoder = Decoder(1024, each_dim[2])
+    def forward(self, spectrogram, time_steps=None):
+        if time_steps is None:
+            time_steps = self.n_frames
+        x1 = self.down1(spectrogram)
+        x = self.down(x1)
+        x2 = self.down2(x)
+        x = self.down(x2)
+        x3 = self.down3(x)
+        x = self.down(x3)
+        x = self.down4(x)
+        x = self.up(x)
+        face = self.face_decoder(x, x1, x2, x3)
+        body = self.body_decoder(x, x1, x2, x3)
+        hand = self.hand_decoder(x, x1, x2, x3)
+        return face, body, hand
+class Generator(nn.Module):
+    def __init__(self,
+                 each_dim,
+                 training=False,
+                 device=None
+                 ):
+        super().__init__()
+        self.training = training
+        self.device = device
+        self.encoderdecoder = EncoderDecoder(15, each_dim)
+    def forward(self, in_spec, time_steps=None):
+        if time_steps is not None:
+            self.gen_length = time_steps
+        face, body, hand = self.encoderdecoder(in_spec)
+        out = torch.cat([face, body, hand], dim=1)
+        out = out.transpose(1, 2)
+        return out
+class Discriminator(nn.Module):
+    def __init__(self, input_dim):
+        super().__init__()
+        self.net = nn.Sequential(
+            ConvNormRelu(input_dim, 128, '1d'),
+            ConvNormRelu(128, 256, '1d'),
+            nn.MaxPool1d(kernel_size=2),
+            ConvNormRelu(256, 256, '1d'),
+            ConvNormRelu(256, 512, '1d'),
+            nn.MaxPool1d(kernel_size=2),
+            ConvNormRelu(512, 512, '1d'),
+            ConvNormRelu(512, 1024, '1d'),
+            nn.MaxPool1d(kernel_size=2),
+            nn.Conv1d(1024, 1, 1, 1),
+            nn.Sigmoid()
+        )
+    def forward(self, x):
+        x = x.transpose(1, 2)
+        out = self.net(x)
+        return out
+class TrainWrapper(TrainWrapperBaseClass):
+    def __init__(self, args, config) -> None:
+        self.args = args
+        self.config = config
+        self.device = torch.device(self.args.gpu)
+        self.global_step = 0
+        self.convert_to_6d = self.config.Data.pose.convert_to_6d
+        self.init_params()
+        self.generator = Generator(
+            each_dim=self.each_dim,
+            training=not self.args.infer,
+            device=self.device,
+        ).to(self.device)
+        self.discriminator = Discriminator(
+            input_dim=self.each_dim[1] + self.each_dim[2] + 64
+        ).to(self.device)
+        if self.convert_to_6d:
+            self.c_index = c_index_6d
+        else:
+            self.c_index = c_index_3d
+        self.MSELoss = KeypointLoss().to(self.device)
+        self.L1Loss = L1Loss().to(self.device)
+        super().__init__(args, config)
+    def init_params(self):
+        scale = 1
+        global_orient = round(0 * scale)
+        leye_pose = reye_pose = round(0 * scale)
+        jaw_pose = round(3 * scale)
+        body_pose = round((63 - 24) * scale)
+        left_hand_pose = right_hand_pose = round(45 * scale)
+        expression = 100
+        b_j = 0
+        jaw_dim = jaw_pose
+        b_e = b_j + jaw_dim
+        eye_dim = leye_pose + reye_pose
+        b_b = b_e + eye_dim
+        body_dim = global_orient + body_pose
+        b_h = b_b + body_dim
+        hand_dim = left_hand_pose + right_hand_pose
+        b_f = b_h + hand_dim
+        face_dim = expression
+        self.dim_list = [b_j, b_e, b_b, b_h, b_f]
+        self.full_dim = jaw_dim + eye_dim + body_dim + hand_dim
+        self.pose = int(self.full_dim / round(3 * scale))
+        self.each_dim = [jaw_dim, eye_dim + body_dim, hand_dim, face_dim]
+    def __call__(self, bat):
+        assert (not self.args.infer), "infer mode"
+        self.global_step += 1
+        loss_dict = {}
+        aud, poses = bat['aud_feat'].to(self.device).to(torch.float32), bat['poses'].to(self.device).to(torch.float32)
+        expression = bat['expression'].to(self.device).to(torch.float32)
+        jaw = poses[:, :3, :]
+        poses = poses[:, self.c_index, :]
+        pred = self.generator(in_spec=aud)
+        D_loss, D_loss_dict = self.get_loss(
+            pred_poses=pred.detach(),
+            gt_poses=poses,
+            aud=aud,
+            mode='training_D',
+        )
+        self.discriminator_optimizer.zero_grad()
+        D_loss.backward()
+        self.discriminator_optimizer.step()
+        G_loss, G_loss_dict = self.get_loss(
+            pred_poses=pred,
+            gt_poses=poses,
+            aud=aud,
+            expression=expression,
+            jaw=jaw,
+            mode='training_G',
+        )
+        self.generator_optimizer.zero_grad()
+        G_loss.backward()
+        self.generator_optimizer.step()
+        total_loss = None
+        loss_dict = {}
+        for key in list(D_loss_dict.keys()) + list(G_loss_dict.keys()):
+            loss_dict[key] = G_loss_dict.get(key, 0) + D_loss_dict.get(key, 0)
+        return total_loss, loss_dict
+    def get_loss(self,
+                 pred_poses,
+                 gt_poses,
+                 aud=None,
+                 jaw=None,
+                 expression=None,
+                 mode='training_G',
+                 ):
+        loss_dict = {}
+        aud = aud.transpose(1, 2)
+        gt_poses = gt_poses.transpose(1, 2)
+        gt_aud = torch.cat([gt_poses, aud], dim=2)
+        pred_aud = torch.cat([pred_poses[:, :, 103:], aud], dim=2)
+        if mode == 'training_D':
+            dis_real = self.discriminator(gt_aud)
+            dis_fake = self.discriminator(pred_aud)
+            dis_error = self.MSELoss(torch.ones_like(dis_real).to(self.device), dis_real) + self.MSELoss(
+                torch.zeros_like(dis_fake).to(self.device), dis_fake)
+            loss_dict['dis'] = dis_error
+            return dis_error, loss_dict
+        elif mode == 'training_G':
+            jaw_loss = self.L1Loss(pred_poses[:, :, :3], jaw.transpose(1, 2))
+            face_loss = self.MSELoss(pred_poses[:, :, 3:103], expression.transpose(1, 2))
+            body_loss = self.L1Loss(pred_poses[:, :, 103:142], gt_poses[:, :, :39])
+            hand_loss = self.L1Loss(pred_poses[:, :, 142:], gt_poses[:, :, 39:])
+            l1_loss = jaw_loss + face_loss + body_loss + hand_loss
+            dis_output = self.discriminator(pred_aud)
+            gen_error = self.MSELoss(torch.ones_like(dis_output).to(self.device), dis_output)
+            gen_loss = self.config.Train.weights.keypoint_loss_weight * l1_loss + self.config.Train.weights.gan_loss_weight * gen_error
+            loss_dict['gen'] = gen_error
+            loss_dict['jaw_loss'] = jaw_loss
+            loss_dict['face_loss'] = face_loss
+            loss_dict['body_loss'] = body_loss
+            loss_dict['hand_loss'] = hand_loss
+            return gen_loss, loss_dict
+        else:
+            raise ValueError(mode)
+    def infer_on_audio(self, aud_fn, fps=30, initial_pose=None, norm_stats=None, id=None, B=1, **kwargs):
+        output = []
+        assert self.args.infer, "train mode"
+        self.generator.eval()
+        if self.config.Data.pose.normalization:
+            assert norm_stats is not None
+            data_mean = norm_stats[0]
+            data_std = norm_stats[1]
+        pre_length = self.config.Data.pose.pre_pose_length
+        generate_length = self.config.Data.pose.generate_length
+        # assert pre_length == initial_pose.shape[-1]
+        # pre_poses = initial_pose.permute(0, 2, 1).to(self.device).to(torch.float32)
+        # B = pre_poses.shape[0]
+        aud_feat = get_mfcc_ta(aud_fn, sr=22000, fps=fps, smlpx=True, type='mfcc').transpose(1, 0)
+        num_poses_to_generate = aud_feat.shape[-1]
+        aud_feat = aud_feat[np.newaxis, ...].repeat(B, axis=0)
+        aud_feat = torch.tensor(aud_feat, dtype=torch.float32).to(self.device)
+        with torch.no_grad():
+            pred_poses = self.generator(aud_feat)
+            pred_poses = pred_poses.cpu().numpy()
+        output = pred_poses.squeeze()
+        return output
+    def generate(self, aud, id):
+        self.generator.eval()
+        pred_poses = self.generator(aud)
+        return pred_poses
+if __name__ == '__main__':
+    from trainer.options import parse_args
+    parser = parse_args()
+    args = parser.parse_args(
+        ['--exp_name', '0', '--data_root', '0', '--speakers', '0', '--pre_pose_length', '4', '--generate_length', '64',
+         '--infer'])
+    generator = TrainWrapper(args)
+    aud_fn = '../sample_audio/jon.wav'
+    initial_pose = torch.randn(64, 108, 4)
+    norm_stats = (np.random.randn(108), np.random.randn(108))
+    output = generator.infer_on_audio(aud_fn, initial_pose, norm_stats)
+    print(output.shape)

nets/__init__.py ADDED Viewed

	@@ -0,0 +1,8 @@

+from .smplx_face import TrainWrapper as s2g_face
+from .smplx_body_vq import TrainWrapper as s2g_body_vq
+from .smplx_body_pixel import TrainWrapper as s2g_body_pixel
+from .body_ae import TrainWrapper as s2g_body_ae
+from .LS3DCG import TrainWrapper as LS3DCG
+from .base import TrainWrapperBaseClass
+from .utils import normalize, denormalize

nets/__pycache__/__init__.cpython-37.pyc ADDED Viewed

Binary file (407 Bytes). View file

nets/__pycache__/base.cpython-37.pyc ADDED Viewed

Binary file (2.29 kB). View file

nets/__pycache__/init_model.cpython-37.pyc ADDED Viewed

Binary file (460 Bytes). View file

nets/__pycache__/layers.cpython-37.pyc ADDED Viewed

Binary file (22.7 kB). View file

nets/__pycache__/smplx_body_pixel.cpython-37.pyc ADDED Viewed

Binary file (9.55 kB). View file

nets/__pycache__/smplx_body_vq.cpython-37.pyc ADDED Viewed

Binary file (7.89 kB). View file