Yann Defretin glenn-jocher commited on
Commit
38ff499
·
unverified ·
1 Parent(s): 6f718ce

Update autosplit() with annotated_only option (#2466)

Browse files

* Be able to create dataset from annotated images only

Add the ability to create a dataset/splits only with images that have an annotation file, i.e a .txt file, associated to it. As we talked about this, the absence of a txt file could mean two things:

* either the image wasn't yet labelled by someone,
* either there is no object to detect.

When it's easy to create small datasets, when you have to create datasets with thousands of images (and more coming), it's hard to track where you at and you don't want to wait to have all of them annotated before starting to train. Which means some images would lack txt files and annotations, resulting in label inconsistency as you say in #2313. By adding the annotated_only argument to the function, people could create, if they want to, datasets/splits only with images that were labelled, for sure.

* Cleanup and update print()

Co-authored-by: Glenn Jocher <[email protected]>

Files changed (1) hide show
  1. utils/datasets.py +11 -7
utils/datasets.py CHANGED
@@ -1032,20 +1032,24 @@ def extract_boxes(path='../coco128/'): # from utils.datasets import *; extract_
1032
  b[[1, 3]] = np.clip(b[[1, 3]], 0, h)
1033
  assert cv2.imwrite(str(f), im[b[1]:b[3], b[0]:b[2]]), f'box failure in {f}'
1034
 
1035
-
1036
- def autosplit(path='../coco128', weights=(0.9, 0.1, 0.0)): # from utils.datasets import *; autosplit('../coco128')
1037
  """ Autosplit a dataset into train/val/test splits and save path/autosplit_*.txt files
1038
- # Arguments
1039
- path: Path to images directory
1040
- weights: Train, val, test weights (list)
 
 
1041
  """
1042
  path = Path(path) # images dir
1043
- files = list(path.rglob('*.*'))
1044
  n = len(files) # number of files
1045
  indices = random.choices([0, 1, 2], weights=weights, k=n) # assign each image to a split
 
1046
  txt = ['autosplit_train.txt', 'autosplit_val.txt', 'autosplit_test.txt'] # 3 txt files
1047
  [(path / x).unlink() for x in txt if (path / x).exists()] # remove existing
 
 
1048
  for i, img in tqdm(zip(indices, files), total=n):
1049
- if img.suffix[1:] in img_formats:
1050
  with open(path / txt[i], 'a') as f:
1051
  f.write(str(img) + '\n') # add image to txt file
 
1032
  b[[1, 3]] = np.clip(b[[1, 3]], 0, h)
1033
  assert cv2.imwrite(str(f), im[b[1]:b[3], b[0]:b[2]]), f'box failure in {f}'
1034
 
1035
+ def autosplit(path='../coco128', weights=(0.9, 0.1, 0.0), annotated_only=False):
 
1036
  """ Autosplit a dataset into train/val/test splits and save path/autosplit_*.txt files
1037
+ Usage: from utils.datasets import *; autosplit('../coco128')
1038
+ Arguments
1039
+ path: Path to images directory
1040
+ weights: Train, val, test weights (list)
1041
+ annotated_only: Only use images with an annotated txt file
1042
  """
1043
  path = Path(path) # images dir
1044
+ files = sum([list(path.rglob(f"*.{img_ext}")) for img_ext in img_formats], []) # image files only
1045
  n = len(files) # number of files
1046
  indices = random.choices([0, 1, 2], weights=weights, k=n) # assign each image to a split
1047
+
1048
  txt = ['autosplit_train.txt', 'autosplit_val.txt', 'autosplit_test.txt'] # 3 txt files
1049
  [(path / x).unlink() for x in txt if (path / x).exists()] # remove existing
1050
+
1051
+ print(f'Autosplitting images from {path}' + ', using *.txt labeled images only' * annotated_only)
1052
  for i, img in tqdm(zip(indices, files), total=n):
1053
+ if not annotated_only or Path(img2label_paths([str(img)])[0]).exists(): # check label
1054
  with open(path / txt[i], 'a') as f:
1055
  f.write(str(img) + '\n') # add image to txt file