Post
267
With flan-t5-base and clip models as teachers; I have produced and successfully trained a dual-shunt cross-attention adapter archetype. This is NOT a lora.
This adapter is currently tasked with taking the T5-flan-base to guide the outputs of VIT-L-14 and/or VIT-bigG-14, and the opposite is equally usable and utilizable within the archetype. Meaning the CLIP_G can also guide the T5-FLAN-base.
These checkpoints were trained with 20 million synthetic human-templated captions, and they can be heavily improved by multiple languages, additional depiction context, and any sort of finetune task desired of the user that can be applied to the T5-flan-base with little to no training due to the adapter's functionality and accuracy.
VIT-L-14 adapters only took a couple hours on a colab a100 and the VIT-bigG-14 took about 4 hours. So you can rapidly adapt many of these in short periods of time with almost no additional overhead beyond the single t5-flan-base required. Each can be compiled, loaded, and offloaded.
This is a cross-attention system meant to shape encoded text after the output is received from the clip models and is very fast to inference - the t5-flan-base on the other hand isn't the fastest.
It's trained on a form of cooperative association with a series of complex losses designed specifically for this associative process.
This adapter has individual gating for tokenization context with a multitude of safeguards to prevent overfitting during rapid learning and can be paired with any number of additional other adapters.
I'm currently formatting the comfyui nodes that will allow easy conditioning shift to showcase the full power of this cooperative system's capability.
The comfyui nodes will be available here shortly, I just need to write them.
https://github.com/AbstractEyes/comfy-clip-shunts
This adapter is currently tasked with taking the T5-flan-base to guide the outputs of VIT-L-14 and/or VIT-bigG-14, and the opposite is equally usable and utilizable within the archetype. Meaning the CLIP_G can also guide the T5-FLAN-base.
These checkpoints were trained with 20 million synthetic human-templated captions, and they can be heavily improved by multiple languages, additional depiction context, and any sort of finetune task desired of the user that can be applied to the T5-flan-base with little to no training due to the adapter's functionality and accuracy.
VIT-L-14 adapters only took a couple hours on a colab a100 and the VIT-bigG-14 took about 4 hours. So you can rapidly adapt many of these in short periods of time with almost no additional overhead beyond the single t5-flan-base required. Each can be compiled, loaded, and offloaded.
This is a cross-attention system meant to shape encoded text after the output is received from the clip models and is very fast to inference - the t5-flan-base on the other hand isn't the fastest.
It's trained on a form of cooperative association with a series of complex losses designed specifically for this associative process.
This adapter has individual gating for tokenization context with a multitude of safeguards to prevent overfitting during rapid learning and can be paired with any number of additional other adapters.
I'm currently formatting the comfyui nodes that will allow easy conditioning shift to showcase the full power of this cooperative system's capability.
The comfyui nodes will be available here shortly, I just need to write them.
https://github.com/AbstractEyes/comfy-clip-shunts