Spaces:
Runtime error
Runtime error
Commit
·
ed3641c
1
Parent(s):
8527e35
Fix intro
Browse files- sections/intro/intro.md +1 -1
sections/intro/intro.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
Visual Question Answering (VQA) is a task where we expect the AI to answer a question about a given image. VQA has been an active area of research for the past 4-5 years, with most datasets using natural images found online. Two examples of such datasets are: [VQAv2](https://visualqa.org/challenge.html), [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html). VQA is a particularly interesting multi-modal machine learning challenge because it has several interesting applications across several domains including healthcare chatbots, interactive-agents, etc. **However, most VQA challenges or datasets deal with English-only captions and questions.**
|
2 |
|
3 |
In addition, even recent **approaches that have been proposed for VQA generally are obscure** due to the fact that CNN-based object detectors are relatively difficult to use and more complex for feature extraction. For example, a FasterRCNN approach uses the following steps:
|
4 |
-
- a FPN (Feature Pyramid Net) over a ResNet backbone, and
|
5 |
- then a RPN (Regision Proposal Network) layer detects proposals in those features, and
|
6 |
- then the ROI (Region of Interest) heads get the box proposals in the original image, and
|
7 |
- the the boxes are selected using a NMS (Non-max suppression),
|
|
|
1 |
Visual Question Answering (VQA) is a task where we expect the AI to answer a question about a given image. VQA has been an active area of research for the past 4-5 years, with most datasets using natural images found online. Two examples of such datasets are: [VQAv2](https://visualqa.org/challenge.html), [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html). VQA is a particularly interesting multi-modal machine learning challenge because it has several interesting applications across several domains including healthcare chatbots, interactive-agents, etc. **However, most VQA challenges or datasets deal with English-only captions and questions.**
|
2 |
|
3 |
In addition, even recent **approaches that have been proposed for VQA generally are obscure** due to the fact that CNN-based object detectors are relatively difficult to use and more complex for feature extraction. For example, a FasterRCNN approach uses the following steps:
|
4 |
+
- the image features are given out by a FPN (Feature Pyramid Net) over a ResNet backbone, and
|
5 |
- then a RPN (Regision Proposal Network) layer detects proposals in those features, and
|
6 |
- then the ROI (Region of Interest) heads get the box proposals in the original image, and
|
7 |
- the the boxes are selected using a NMS (Non-max suppression),
|