File size: 27,796 Bytes
db5855f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "07baa7d5-89a4-4b22-9dc2-7e43ec2ff9a9",
   "metadata": {},
   "source": [
    "# Text-to-Video retrieval with S3D MIL-NCE and OpenVINO\n",
    "\n",
    "This tutorial based on [the TensorFlow tutorial](https://www.tensorflow.org/hub/tutorials/text_to_video_retrieval_with_s3d_milnce) that demonstrates how to use the [S3D MIL-NCE](https://tfhub.dev/deepmind/mil-nce/s3d/1) model from TensorFlow Hub to do text-to-video retrieval to find the most similar videos for a given text query.\n",
    "\n",
    "MIL-NCE inherits from Multiple Instance Learning (MIL) and Noise Contrastive Estimation (NCE). The method is capable of addressing visually misaligned narrations from uncurated instructional videos. Two model variations are available with different 3D CNN backbones: I3D and S3D. In this tutorial we use S3D variation. More details about the training and the model can be found in [End-to-End Learning of Visual Representations from Uncurated Instructional Videos](https://arxiv.org/abs/1912.06430) paper.\n",
    "\n",
    "This tutorial demonstrates step-by-step instructions on how to run and optimize S3D MIL-NCE model with OpenVINO. An additional part demonstrates how to run quantization with [NNCF](https://github.com/openvinotoolkit/nncf/) to speed up the inference.\n",
    "\n",
    "The tutorial consists of the following steps:\n",
    "\n",
    "#### Table of contents:\n",
    "- [Prerequisites](#Prerequisites)\n",
    "- [The original inference](#The-original-inference)\n",
    "- [Convert the model to OpenVINO IR](#Convert-the-model-to-OpenVINO-IR)\n",
    "- [Compiling models](#Compiling-models)\n",
    "- [Inference](#Inference)\n",
    "- [Optimize model using NNCF Post-training Quantization API](#Optimize-model-using-NNCF-Post-training-Quantization-API)\n",
    "    - [Prepare dataset](#Prepare-dataset)\n",
    "    - [Perform model quantization](#Perform-model-quantization)\n",
    "- [Run quantized model inference](#Run-quantized-model-inference)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3ac784f8-5511-4631-ad4c-d3dd816f9d07",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "[back to top ⬆️](#Table-of-contents:)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e0ec6500-5ce4-4630-a2a4-813059126a29",
   "metadata": {},
   "outputs": [],
   "source": [
    "import platform\n",
    "\n",
    "%pip install -Uq pip\n",
    "%pip install --upgrade --pre openvino-tokenizers openvino --extra-index-url \"https://storage.openvinotoolkit.org/simple/wheels/nightly\"\n",
    "%pip install -q \"tensorflow-macos>=2.5; sys_platform == 'darwin' and platform_machine == 'arm64' and python_version > '3.8'\" # macOS M1 and M2\n",
    "%pip install -q \"tensorflow-macos>=2.5,<=2.12.0; sys_platform == 'darwin' and platform_machine == 'arm64' and python_version <= '3.8'\" # macOS M1 and M2\n",
    "%pip install -q \"tensorflow>=2.5; sys_platform == 'darwin' and platform_machine != 'arm64' and python_version > '3.8'\" # macOS x86\n",
    "%pip install -q \"tensorflow>=2.5,<=2.12.0; sys_platform == 'darwin' and platform_machine != 'arm64' and python_version <= '3.8'\" # macOS x86\n",
    "%pip install -q \"tensorflow>=2.5; sys_platform != 'darwin' and python_version > '3.8'\"\n",
    "%pip install -q \"tensorflow>=2.5,<=2.12.0; sys_platform != 'darwin' and python_version <= '3.8'\"\n",
    "\n",
    "%pip install -q tensorflow_hub tf_keras numpy \"opencv-python\" \"nncf>=2.10.0\"\n",
    "if platform.system() != \"Windows\":\n",
    "    %pip install -q \"matplotlib>=3.4\"\n",
    "else:\n",
    "    %pip install -q \"matplotlib>=3.4,<3.7\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "0153bf4b-94e1-44a8-99e6-4a2433b29c23",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "import tensorflow as tf\n",
    "import tensorflow_hub as hub\n",
    "\n",
    "import numpy as np\n",
    "import cv2\n",
    "from IPython import display\n",
    "import math\n",
    "\n",
    "os.environ[\"TFHUB_CACHE_DIR\"] = str(Path(\"./tfhub_modules\").resolve())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b6b3c5db-ae42-45a0-ada0-7f1319221621",
   "metadata": {},
   "source": [
    "Download the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "4dea4952-780b-42e1-a973-c45fd4b8508e",
   "metadata": {},
   "outputs": [],
   "source": [
    "hub_handle = \"https://www.kaggle.com/models/deepmind/mil-nce/TensorFlow1/s3d/1\"\n",
    "hub_model = hub.load(hub_handle)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8a962845-20cf-454a-b796-59dc7891ebcf",
   "metadata": {},
   "source": [
    "The model has 2 signatures, one for generating video embeddings and one for generating text embeddings. We will use these embedding to find the nearest neighbors in the embedding space as in the original tutorial. Below we will define auxiliary functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "74e1ea65-7bae-4a49-a520-8e5898adbbce",
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_embeddings(model, input_frames, input_words):\n",
    "    \"\"\"Generate embeddings from the model from video frames and input words.\"\"\"\n",
    "    # Input_frames must be normalized in [0, 1] and of the shape Batch x T x H x W x 3\n",
    "    vision_output = model.signatures[\"video\"](tf.constant(tf.cast(input_frames, dtype=tf.float32)))\n",
    "    text_output = model.signatures[\"text\"](tf.constant(input_words))\n",
    "\n",
    "    return vision_output[\"video_embedding\"], text_output[\"text_embedding\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "d762ea2b-77a4-4d1b-9121-e3b4861534cb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# @title Define video loading and visualization functions  { display-mode: \"form\" }\n",
    "\n",
    "\n",
    "# Utilities to open video files using CV2\n",
    "def crop_center_square(frame):\n",
    "    y, x = frame.shape[0:2]\n",
    "    min_dim = min(y, x)\n",
    "    start_x = (x // 2) - (min_dim // 2)\n",
    "    start_y = (y // 2) - (min_dim // 2)\n",
    "    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]\n",
    "\n",
    "\n",
    "def load_video(video_url, max_frames=32, resize=(224, 224)):\n",
    "    path = tf.keras.utils.get_file(os.path.basename(video_url)[-128:], video_url)\n",
    "    cap = cv2.VideoCapture(path)\n",
    "    frames = []\n",
    "    try:\n",
    "        while True:\n",
    "            ret, frame = cap.read()\n",
    "            if not ret:\n",
    "                break\n",
    "            frame = crop_center_square(frame)\n",
    "            frame = cv2.resize(frame, resize)\n",
    "            frame = frame[:, :, [2, 1, 0]]\n",
    "            frames.append(frame)\n",
    "\n",
    "            if len(frames) == max_frames:\n",
    "                break\n",
    "    finally:\n",
    "        cap.release()\n",
    "    frames = np.array(frames)\n",
    "    if len(frames) < max_frames:\n",
    "        n_repeat = int(math.ceil(max_frames / float(len(frames))))\n",
    "        frames = frames.repeat(n_repeat, axis=0)\n",
    "    frames = frames[:max_frames]\n",
    "    return frames / 255.0\n",
    "\n",
    "\n",
    "def display_video(urls):\n",
    "    html = \"<table>\"\n",
    "    html += \"<tr><th>Video 1</th><th>Video 2</th><th>Video 3</th></tr><tr>\"\n",
    "    for url in urls:\n",
    "        html += \"<td>\"\n",
    "        html += '<img src=\"{}\" height=\"224\">'.format(url)\n",
    "        html += \"</td>\"\n",
    "    html += \"</tr></table>\"\n",
    "    return display.HTML(html)\n",
    "\n",
    "\n",
    "def display_query_and_results_video(query, urls, scores):\n",
    "    \"\"\"Display a text query and the top result videos and scores.\"\"\"\n",
    "    sorted_ix = np.argsort(-scores)\n",
    "    html = \"\"\n",
    "    html += \"<h2>Input query: <i>{}</i> </h2><div>\".format(query)\n",
    "    html += \"Results: <div>\"\n",
    "    html += \"<table>\"\n",
    "    html += \"<tr><th>Rank #1, Score:{:.2f}</th>\".format(scores[sorted_ix[0]])\n",
    "    html += \"<th>Rank #2, Score:{:.2f}</th>\".format(scores[sorted_ix[1]])\n",
    "    html += \"<th>Rank #3, Score:{:.2f}</th></tr><tr>\".format(scores[sorted_ix[2]])\n",
    "    for i, idx in enumerate(sorted_ix):\n",
    "        url = urls[sorted_ix[i]]\n",
    "        html += \"<td>\"\n",
    "        html += '<img src=\"{}\" height=\"224\">'.format(url)\n",
    "        html += \"</td>\"\n",
    "    html += \"</tr></table>\"\n",
    "\n",
    "    return html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "09e0047d-6fb8-4b18-a66d-e8e8a37154ba",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table><tr><th>Video 1</th><th>Video 2</th><th>Video 3</th></tr><tr><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/b/b0/YosriAirTerjun.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/e/e6/Guitar_solo_gif.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/3/30/2009-08-16-autodrift-by-RalfR-gif-by-wau.gif\" height=\"224\"></td></tr></table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# @title Load example videos and define text queries  { display-mode: \"form\" }\n",
    "\n",
    "video_1_url = \"https://upload.wikimedia.org/wikipedia/commons/b/b0/YosriAirTerjun.gif\"  # @param {type:\"string\"}\n",
    "video_2_url = \"https://upload.wikimedia.org/wikipedia/commons/e/e6/Guitar_solo_gif.gif\"  # @param {type:\"string\"}\n",
    "video_3_url = \"https://upload.wikimedia.org/wikipedia/commons/3/30/2009-08-16-autodrift-by-RalfR-gif-by-wau.gif\"  # @param {type:\"string\"}\n",
    "\n",
    "video_1 = load_video(video_1_url)\n",
    "video_2 = load_video(video_2_url)\n",
    "video_3 = load_video(video_3_url)\n",
    "all_videos = [video_1, video_2, video_3]\n",
    "\n",
    "query_1_video = \"waterfall\"  # @param {type:\"string\"}\n",
    "query_2_video = \"playing guitar\"  # @param {type:\"string\"}\n",
    "query_3_video = \"car drifting\"  # @param {type:\"string\"}\n",
    "all_queries_video = [query_1_video, query_2_video, query_3_video]\n",
    "all_videos_urls = [video_1_url, video_2_url, video_3_url]\n",
    "display_video(all_videos_urls)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5b91a14a-d565-4526-bd81-7a6c3e15c235",
   "metadata": {},
   "source": [
    "## The original inference\n",
    "[back to top ⬆️](#Table-of-contents:)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "0563060f-0ff4-428f-8ac6-9f7b240e570f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare video inputs.\n",
    "videos_np = np.stack(all_videos, axis=0)\n",
    "\n",
    "# Prepare text input.\n",
    "words_np = np.array(all_queries_video)\n",
    "\n",
    "# Generate the video and text embeddings.\n",
    "video_embd, text_embd = generate_embeddings(hub_model, videos_np, words_np)\n",
    "\n",
    "# Scores between video and text is computed by dot products.\n",
    "all_scores = np.dot(text_embd, tf.transpose(video_embd))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "5fac7925-676e-46b5-b48b-d2b3397d51de",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<h2>Input query: <i>waterfall</i> </h2><div>Results: <div><table><tr><th>Rank #1, Score:4.71</th><th>Rank #2, Score:-1.63</th><th>Rank #3, Score:-4.17</th></tr><tr><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/b/b0/YosriAirTerjun.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/3/30/2009-08-16-autodrift-by-RalfR-gif-by-wau.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/e/e6/Guitar_solo_gif.gif\" height=\"224\"></td></tr></table><br><h2>Input query: <i>playing guitar</i> </h2><div>Results: <div><table><tr><th>Rank #1, Score:6.50</th><th>Rank #2, Score:-1.79</th><th>Rank #3, Score:-2.67</th></tr><tr><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/e/e6/Guitar_solo_gif.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/b/b0/YosriAirTerjun.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/3/30/2009-08-16-autodrift-by-RalfR-gif-by-wau.gif\" height=\"224\"></td></tr></table><br><h2>Input query: <i>car drifting</i> </h2><div>Results: <div><table><tr><th>Rank #1, Score:8.78</th><th>Rank #2, Score:-1.07</th><th>Rank #3, Score:-2.17</th></tr><tr><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/3/30/2009-08-16-autodrift-by-RalfR-gif-by-wau.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/b/b0/YosriAirTerjun.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/e/e6/Guitar_solo_gif.gif\" height=\"224\"></td></tr></table><br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display results.\n",
    "html = \"\"\n",
    "for i, words in enumerate(words_np):\n",
    "    html += display_query_and_results_video(words, all_videos_urls, all_scores[i, :])\n",
    "    html += \"<br>\"\n",
    "display.HTML(html)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "67a0373a-ee9e-4728-97b2-355d1ed501a3",
   "metadata": {},
   "source": [
    "## Convert the model to OpenVINO IR\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "OpenVINO supports TensorFlow models via conversion into Intermediate Representation (IR) format. We need to provide a model object, input data for model tracing to `ov.convert_model` function to obtain OpenVINO `ov.Model` object instance. Model can be saved on disk for next deployment using `ov.save_model` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "e1f51b7b-a215-44af-90be-83eb9d2cfd43",
   "metadata": {},
   "outputs": [],
   "source": [
    "import openvino_tokenizers  # NOQA Need to import conversion and operation extensions\n",
    "import openvino as ov\n",
    "\n",
    "model_path = hub.resolve(hub_handle)\n",
    "# infer on random data\n",
    "images_data = np.random.rand(3, 32, 224, 224, 3).astype(np.float32)\n",
    "words_data = np.array([\"First sentence\", \"Second one\", \"Abracadabra\"], dtype=str)\n",
    "\n",
    "ov_model = ov.convert_model(model_path, input=[(\"words\", [3]), (\"images\", [3, 32, 224, 224, 3])])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "62128306-2021-41eb-a965-2494fa8abbf2",
   "metadata": {},
   "source": [
    "## Compiling models\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "Select device from dropdown list for running inference using OpenVINO."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "d3f0dd83-4803-41f8-8af2-1117b25db19f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "217b87d883b5458d924f457f26fbbf92",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Dropdown(description='Device:', index=4, options=('CPU', 'GPU.0', 'GPU.1', 'GPU.2', 'AUTO'), value='AUTO')"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import ipywidgets as widgets\n",
    "\n",
    "core = ov.Core()\n",
    "device = widgets.Dropdown(\n",
    "    options=core.available_devices + [\"AUTO\"],\n",
    "    value=\"AUTO\",\n",
    "    description=\"Device:\",\n",
    "    disabled=False,\n",
    ")\n",
    "\n",
    "device"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "c3306fcc-bcfc-41f9-adbb-79afe259e4cb",
   "metadata": {},
   "outputs": [],
   "source": [
    "compiled_model = core.compile_model(ov_model, device.value)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "632d8390-36b8-4e46-9226-2546d3be678b",
   "metadata": {},
   "source": [
    "## Inference\n",
    "[back to top ⬆️](#Table-of-contents:)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "0e653b36-2c45-4609-b24f-64401ed0ca13",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Redefine `generate_embeddings` function to make it possible to use the compile IR model.\n",
    "def generate_embeddings(model, input_frames, input_words):\n",
    "    \"\"\"Generate embeddings from the model from video frames and input words.\"\"\"\n",
    "    # Input_frames must be normalized in [0, 1] and of the shape Batch x T x H x W x 3\n",
    "    output = compiled_model({\"words\": input_words, \"images\": tf.cast(input_frames, dtype=tf.float32)})\n",
    "\n",
    "    return output[\"video_embedding\"], output[\"text_embedding\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "a28a44c9-9c1a-427f-b7a8-eab007d0f4e4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate the video and text embeddings.\n",
    "video_embd, text_embd = generate_embeddings(compiled_model, videos_np, words_np)\n",
    "\n",
    "# Scores between video and text is computed by dot products.\n",
    "all_scores = np.dot(text_embd, tf.transpose(video_embd))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "545a657b-849f-454a-a8c9-17f1a653f3ae",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<h2>Input query: <i>waterfall</i> </h2><div>Results: <div><table><tr><th>Rank #1, Score:4.71</th><th>Rank #2, Score:-1.63</th><th>Rank #3, Score:-4.17</th></tr><tr><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/b/b0/YosriAirTerjun.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/3/30/2009-08-16-autodrift-by-RalfR-gif-by-wau.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/e/e6/Guitar_solo_gif.gif\" height=\"224\"></td></tr></table><br><h2>Input query: <i>playing guitar</i> </h2><div>Results: <div><table><tr><th>Rank #1, Score:6.50</th><th>Rank #2, Score:-1.79</th><th>Rank #3, Score:-2.67</th></tr><tr><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/e/e6/Guitar_solo_gif.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/b/b0/YosriAirTerjun.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/3/30/2009-08-16-autodrift-by-RalfR-gif-by-wau.gif\" height=\"224\"></td></tr></table><br><h2>Input query: <i>car drifting</i> </h2><div>Results: <div><table><tr><th>Rank #1, Score:8.78</th><th>Rank #2, Score:-1.07</th><th>Rank #3, Score:-2.17</th></tr><tr><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/3/30/2009-08-16-autodrift-by-RalfR-gif-by-wau.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/b/b0/YosriAirTerjun.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/e/e6/Guitar_solo_gif.gif\" height=\"224\"></td></tr></table><br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display results.\n",
    "html = \"\"\n",
    "for i, words in enumerate(words_np):\n",
    "    html += display_query_and_results_video(words, all_videos_urls, all_scores[i, :])\n",
    "    html += \"<br>\"\n",
    "display.HTML(html)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "947cdd77-4e5c-4dbe-ac46-3581bde15b47",
   "metadata": {},
   "source": [
    "## Optimize model using NNCF Post-training Quantization API\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "[NNCF](https://github.com/openvinotoolkit/nncf) provides a suite of advanced algorithms for Neural Networks inference optimization in OpenVINO with minimal accuracy drop.\n",
    "We will use 8-bit quantization in post-training mode (without the fine-tuning pipeline).\n",
    "The optimization process contains the following steps:\n",
    "\n",
    "1. Create a Dataset for quantization.\n",
    "2. Run `nncf.quantize` for getting an optimized model.\n",
    "3. Serialize an OpenVINO IR model, using the `ov.save_model` function."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "378312a9-3d90-40e4-b5fa-8de2990556dc",
   "metadata": {},
   "source": [
    "### Prepare dataset\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "This model doesn't require a big dataset for calibration. We will use only example videos for this purpose.\n",
    "NNCF provides `nncf.Dataset` wrapper for using native framework dataloaders in quantization pipeline. Additionally, we specify transform function that will be responsible for preparing input data in model expected format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "ec4eae46-f5b0-41b8-83f6-709fc56aaaa4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import nncf\n",
    "\n",
    "dataset = nncf.Dataset(((words_np, tf.cast(videos_np, dtype=tf.float32)),))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a76fa7a5-bc18-456c-b2aa-326d9eae7af3",
   "metadata": {},
   "source": [
    "### Perform model quantization\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "The `nncf.quantize` function provides an interface for model quantization. It requires an instance of the OpenVINO Model and quantization dataset. \n",
    "Optionally, some additional parameters for the configuration quantization process (number of samples for quantization, preset, ignored scope etc.) can be provided."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "6e594e8d-0ec5-4c46-a971-93b4aee3ba05",
   "metadata": {},
   "outputs": [],
   "source": [
    "MODEL_DIR = Path(\"model/\")\n",
    "MODEL_DIR.mkdir(exist_ok=True)\n",
    "\n",
    "quantized_model_path = MODEL_DIR / \"quantized_model.xml\"\n",
    "\n",
    "\n",
    "if not quantized_model_path.exists():\n",
    "    quantized_model = nncf.quantize(model=ov_model, calibration_dataset=dataset, model_type=nncf.ModelType.TRANSFORMER)\n",
    "    ov.save_model(quantized_model, quantized_model_path)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e850db35-9094-4294-aa4e-b5017a80e1f5",
   "metadata": {},
   "source": [
    "## Run quantized model inference\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "There are no changes in model usage after applying quantization. Let's check the model work on the previously used example. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "aa59c47e-7d11-405e-a7e3-10d2054401ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "int8_model = core.compile_model(quantized_model_path, device.value)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "3d697fcd-d283-40af-bdd3-ffbf6f0a15e4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate the video and text embeddings.\n",
    "video_embd, text_embd = generate_embeddings(int8_model, videos_np, words_np)\n",
    "\n",
    "# Scores between video and text is computed by dot products.\n",
    "all_scores = np.dot(text_embd, tf.transpose(video_embd))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "0e944f55-a6be-4ab6-98e3-369e8f2ce10e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<h2>Input query: <i>waterfall</i> </h2><div>Results: <div><table><tr><th>Rank #1, Score:4.71</th><th>Rank #2, Score:-1.63</th><th>Rank #3, Score:-4.17</th></tr><tr><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/b/b0/YosriAirTerjun.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/3/30/2009-08-16-autodrift-by-RalfR-gif-by-wau.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/e/e6/Guitar_solo_gif.gif\" height=\"224\"></td></tr></table><br><h2>Input query: <i>playing guitar</i> </h2><div>Results: <div><table><tr><th>Rank #1, Score:6.50</th><th>Rank #2, Score:-1.79</th><th>Rank #3, Score:-2.67</th></tr><tr><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/e/e6/Guitar_solo_gif.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/b/b0/YosriAirTerjun.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/3/30/2009-08-16-autodrift-by-RalfR-gif-by-wau.gif\" height=\"224\"></td></tr></table><br><h2>Input query: <i>car drifting</i> </h2><div>Results: <div><table><tr><th>Rank #1, Score:8.78</th><th>Rank #2, Score:-1.07</th><th>Rank #3, Score:-2.17</th></tr><tr><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/3/30/2009-08-16-autodrift-by-RalfR-gif-by-wau.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/b/b0/YosriAirTerjun.gif\" height=\"224\"></td><td><img src=\"https://upload.wikimedia.org/wikipedia/commons/e/e6/Guitar_solo_gif.gif\" height=\"224\"></td></tr></table><br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display results.\n",
    "html = \"\"\n",
    "for i, words in enumerate(words_np):\n",
    "    html += display_query_and_results_video(words, all_videos_urls, all_scores[i, :])\n",
    "    html += \"<br>\"\n",
    "display.HTML(html)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "openvino_notebooks": {
   "imageUrl": "https://github.com/openvinotoolkit/openvino_notebooks/assets/76171391/ba516a81-f6f7-4258-9e3b-931d6db7728c",
   "tags": {
    "categories": [
     "Model Demos"
    ],
    "libraries": [],
    "other": [],
    "tasks": [
     "Text-to-Video Retrieval"
    ]
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}