File size: 2,753 Bytes
92e0882
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
## Zero-Shot Referring Expression Comprehension on RefCOCO

**Preparing Data**

1.Download [images for RefCOCO/g/+](http://images.cocodataset.org/zips/train2014.zip). Put downloaded dataset(train2014) to eval/rec_zs_test/data/.

2.Download preprocessed data files via `gsutil cp gs://reclip-sanjays/reclip_data.tar.gz` and `cd rec_zs_test`, and then extract the data using `tar -xvzf reclip_data.tar.gz`. 

**Preparing model**

3.Download [SAM](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth) (vit-h), [Alpha-CLIP](https://github.com/SunzeY/AlphaCLIP/blob/main/model-zoo.md) model, and put them in ./eval/rec_zs_test/ckpt.

```
β”œβ”€β”€ eval
β”‚   β”œβ”€β”€ rec_zs_test
β”‚   β”‚   β”œβ”€β”€ data
β”‚   β”‚       └── train2014
β”‚   β”‚   β”œβ”€β”€ reclip_data
β”‚   β”‚       └── refcoco_val.jsonl
β”‚   β”‚       └── refcoco_dets_dict.json
β”‚   β”‚           ...
β”‚   β”‚   β”œβ”€β”€ ckpt
β”‚   β”‚       └── sam_vit_h_4b8939.pth
β”‚   β”‚       └── grit1m
β”‚   β”‚           └── clip_b16_grit+mim_fultune_4xe.pth
β”‚   β”‚           └── clip_l14_grit+mim_fultune_6xe.pth
β”‚   β”‚   β”œβ”€β”€ methods
β”‚   β”‚   β”œβ”€β”€ cache
β”‚   β”‚   β”œβ”€β”€ output
β”‚   β”‚   β”œβ”€β”€ main.py
β”‚   β”‚   β”œβ”€β”€ executor.py
β”‚   β”‚   β”œβ”€β”€ run.sh
β”‚   β”‚   β”œβ”€β”€ ...
```

4.run test script.

```
cd eval/rec_zs_test
```
```
bash run.sh
```
or

```
python main.py --input_file reclip_data/refcoco_val.jsonl --image_root ./data/train2014 --method parse --gradcam_alpha 0.5 0.5 --box_representation_method full,blur --box_method_aggregator sum --clip_model ViT-B/16,ViT-L/14 --detector_file reclip_data/refcoco+_dets_dict.json --cache_path ./cache
```
(We recommend using `cache_path` to reduce time to generate mask by SAM for a image repeatedly.`)

For multi-gpus testing, try:

```
bash run_multi_gpus.sh
python cal_acc.py refcoco_val
```


**Acknowledgement**

We test our model based on the wonderful work [ReCLIP](https://github.com/allenai/reclip/tree/main). We simply replace CLIP with Alpha-CLIP; and skip the image-cropping operation.



**Experiment results**

| Method         | RefCOCO |      |      | RefCOCO+ |      |      | RefCOCOg |      |
|----------------|---------|------|------|----------|------|------|----------|------|
|                | Val     | TestA| TestB| Val      | TestA| TestB| Val      | Test |
| CPT [67]       | 32.2    | 36.1 | 30.3 | 31.9     | 35.2 | 28.8 | 36.7     | 36.5 |
| ReCLIP [54]    | 45.8    | 46.1 | 47.1 | 47.9     | 50.1 | 45.1 | 59.3     | 59.0 |
| Red Circle [52]| 49.8    | 58.6 | 39.9 | 55.3     | 63.9 | 45.4 | 59.4     | 58.9 |
| Alpha-CLIP     | 55.7    | 61.1 | 50.3 | 55.6     | 62.7 | 46.4 | 61.2     | 62.0 |