Validation data
Hi! Sorry to bother you, but would it be possible to share at least the validation dataset?
All the data used in the V3 taggers is available here for inspection: https://huggingface.co/datasets/SmilingWolf/wdtagger-v3-seed
Can do. Is it for a comparison against yours?
I think that the split is a little small, orginally I was basing it off of JoyTag's 32,768 but I should've considered the number of tags that wouldn't end up in it.
Testing it I found these stats:
Analyzed 20116 samples in the split
Found 31169 unique tags (out of 70527 possible tags)
Tag distribution by category:
general: 14387 tags (46.6% of all general tags)
character: 9803 tags (36.4% of all character tags)
artist: 4671 tags (66.7% of all artist tags)
copyright: 2081 tags (38.8% of all copyright tags)
meta: 203 tags (62.8% of all meta tags)
year: 20 tags (100.0% of all year tags)
rating: 4 tags (100.0% of all rating tags)
Which was better than expected but still of course not the majority.
Can do. Is it for a comparison against yours?
Busted :D
Yeah I was mostly worried about the amount of true positives in the test set.
With my filtering and 300k samples in the val split, I still end up with a minimum of 15 samples for a tag which is defo not great.
I hypothesised that with lower requirements for the amount of samples per tag you may have been "saving" more images on the tags-per-image front, in turn getting a few more tags in, but it still felt somewhat low.
As a comparison, the val split in my dataset has got 8106 general tags, and a min amount of samples per tag of 15.
Here are the stats for the test set:
=== Tag Statistics Per Sample ===
Total samples analyzed: 20116
Tags per sample:
Minimum: 29
Maximum: 628
Mean: 45.64
Median: 43.00
Standard deviation: 13.30
Percentiles (tags per sample):
10th percentile: 34.0
25th percentile: 37.0
75th percentile: 51.0
90th percentile: 61.0
95th percentile: 68.0
99th percentile: 91.0
Sample distribution:
Samples with 0 tags: 0 (0.00%)
Samples with 1-5 tags: 0 (0.00%)
Samples with 6-10 tags: 0 (0.00%)
Samples with 11-20 tags: 0 (0.00%)
Samples with 21-50 tags: 14935 (74.24%)
Samples with 51+ tags: 5181 (25.76%)
Would you still like me to upload the test dataset for the performance comparison or to verify any of this? Note that my dataset doesn't include the images just the path as I resized on the fly.
Would you still like me to upload the test dataset for the performance comparison or to verify any of this?
Yeah if you don't mind. Getting the images is not a problem, I'll handle that myself, as long as the danbooru ID is available somewhere.
It's in training/val_dataset.csv. Note that the rating and year tags are in a different format.
Thank you!
I'll work on it in the weekend and report here the results. Code will be uploaded somewhere public (likely github).
🫡