0,1,2,3,4,5,correct,choiceA_probs,choiceB_probs,choiceC_probs,choiceD_probs Statement 1| Linear regression estimator has the smallest variance among all unbiased estimators. Statement 2| The coefficients α assigned to the classifiers assembled by AdaBoost are always non-negative.,"True, True","False, False","True, False","False, True",D,False,0.007495197933167219,0.009624025784432888,0.005483604036271572,0.009624025784432888 Statement 1| RoBERTa pretrains on a corpus that is approximate 10x larger than the corpus BERT pretrained on. Statement 2| ResNeXts in 2018 usually used tanh activation functions.,"True, True","False, False","True, False","False, True",C,False,0.005815896671265364,0.007467759307473898,0.0045294249430298805,0.005132510792464018 "Statement 1| Support vector machines, like logistic regression models, give a probability distribution over the possible labels given an input example. Statement 2| We would expect the support vectors to remain the same in general as we move from a linear kernel to higher order polynomial kernels.","True, True","False, False","True, False","False, True",B,True,0.004206264857202768,0.00540095055475831,0.003077368950471282,0.004766322206705809 "A machine learning problem involves four attributes plus a class. The attributes have 3, 2, 2, and 2 possible values each. The class has 3 possible values. How many maximum possible different examples are there?",12,24,48,72,D,False,0.02486063912510872,0.028170794248580933,0.021939437836408615,0.02486063912510872 "As of 2020, which architecture is best for classifying high-resolution images?",convolutional networks,graph networks,fully connected networks,RBF networks,A,False,0.005807581823319197,0.007938023656606674,0.002743307501077652,0.002743307501077652 Statement 1| The log-likelihood of the data will always increase through successive iterations of the expectation maximation algorithm. Statement 2| One disadvantage of Q-learning is that it can only be used when the learner has prior knowledge of how its actions affect its environment.,"True, True","False, False","True, False","False, True",B,False,0.008215098641812801,0.008215098641812801,0.0056461491622030735,0.007249799091368914 Let us say that we have computed the gradient of our cost function and stored it in a vector g. What is the cost of one gradient descent update given the gradient?,O(D),O(N),O(ND),O(ND^2),A,False,0.012027136981487274,0.010613911785185337,0.005681217648088932,0.025461450219154358 "Statement 1| For a continuous random variable x and its probability distribution function p(x), it holds that 0 ≤ p(x) ≤ 1 for all x. Statement 2| Decision tree is learned by minimizing information gain.","True, True","False, False","True, False","False, True",B,True,0.008496910333633423,0.00962826143950224,0.00583983538672328,0.006617399863898754 Consider the Bayesian network given below. How many independent parameters are needed for this Bayesian Network H -> U <- P <- W?,2,4,8,16,C,False,0.015344777144491673,0.025299260392785072,0.015344777144491673,0.019703084602952003 "As the number of training examples goes to infinity, your model trained on that data will have:",Lower variance,Higher variance,Same variance,None of the above,A,False,0.024279175326228142,0.024279175326228142,0.01668681763112545,0.031175078824162483 Statement 1| The set of all rectangles in the 2D plane (which includes non axisaligned rectangles) can shatter a set of 5 points. Statement 2| The VC-dimension of k-Nearest Neighbour classifier when k = 1 is infinite.,"True, True","False, False","True, False","False, True",A,True,0.007453136146068573,0.007453136146068573,0.005122460424900055,0.005804507993161678 _ refers to a model that can neither model the training data nor generalize to new data.,good fitting,overfitting,underfitting,all of the above,C,False,0.017412282526493073,0.028708001598715782,0.008224980905652046,0.03253042697906494 Statement 1| The F1 score can be especially useful for datasets with class high imbalance. Statement 2| The area under the ROC curve is one of the main metrics used to assess anomaly detectors.,"True, True","False, False","True, False","False, True",A,True,0.007496068719774485,0.007496068719774485,0.004546595737338066,0.005837944336235523 "Statement 1| The back-propagation algorithm learns a globally optimal neural network with hidden layers. Statement 2| The VC dimension of a line should be at most 2, since I can find at least one case of 3 points that cannot be shattered by any line.","True, True","False, False","True, False","False, True",B,True,0.006486046593636274,0.008328249678015709,0.005723916459828615,0.006486046593636274 High entropy means that the partitions in classification are,pure,not pure,useful,useless,B,True,0.01849903166294098,0.039162445813417435,0.01849903166294098,0.020962147042155266 "Statement 1| Layer Normalization is used in the original ResNet paper, not Batch Normalization. Statement 2| DCGANs use self-attention to stabilize training.","True, True","False, False","True, False","False, True",B,True,0.008398331701755524,0.00951655674725771,0.005772083066403866,0.008398331701755524 "In building a linear regression model for a particular data set, you observe the coefficient of one of the features having a relatively high negative value. This suggests that",This feature has a strong effect on the model (should be retained),This feature does not have a strong effect on the model (should be ignored),It is not possible to comment on the importance of this feature without additional information,Nothing can be determined.,C,False,0.06071256846189499,0.04728299379348755,0.013546805828809738,0.02530876360833645 "For a neural network, which one of these structural assumptions is the one that most affects the trade-off between underfitting (i.e. a high bias model) and overfitting (i.e. a high variance model):",The number of hidden nodes,The learning rate,The initial choice of weights,The use of a constant-term unit input,A,True,0.028634995222091675,0.02527029626071453,0.022300956770777702,0.022300956770777702 "For polynomial regression, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:",The polynomial degree,Whether we learn the weights by matrix inversion or gradient descent,The assumed variance of the Gaussian noise,The use of a constant-term unit input,A,True,0.03486732020974159,0.030770301818847656,0.017532389611005783,0.03486732020974159 "Statement 1| As of 2020, some models attain greater than 98% accuracy on CIFAR-10. Statement 2| The original ResNets were not optimized with the Adam optimizer.","True, True","False, False","True, False","False, True",A,False,0.0066759479232132435,0.008572087623178959,0.004588307812809944,0.005891503766179085 The K-means algorithm:,Requires the dimension of the feature space to be no bigger than the number of samples,Has the smallest value of the objective function when K = 1,Minimizes the within class variance for a given number of clusters,Converges to the global optimum if and only if the initial means are chosen as some of the samples themselves,C,False,0.029622497037053108,0.033566687256097794,0.029622497037053108,0.04883924126625061 Statement 1| VGGNets have convolutional kernels of smaller width and height than AlexNet's first-layer kernels. Statement 2| Data-dependent weight initialization procedures were introduced before Batch Normalization.,"True, True","False, False","True, False","False, True",A,False,0.009355410002171993,0.01361204031854868,0.007286000065505505,0.009355410002171993 "What is the rank of the following matrix? A = [[1, 1, 1], [1, 1, 1], [1, 1, 1]]",0,1,2,3,B,True,0.02595275267958641,0.04278885945677757,0.029408320784568787,0.01574113965034485 "Statement 1| Density estimation (using say, the kernel density estimator) can be used to perform classification. Statement 2| The correspondence between logistic regression and Gaussian Naive Bayes (with identity class covariances) means that there is a one-to-one correspondence between the parameters of the two classifiers.","True, True","False, False","True, False","False, True",C,False,0.008289187215268612,0.010643526911735535,0.005027645733207464,0.010643526911735535 Suppose we would like to perform clustering on spatial data such as the geometrical locations of houses. We wish to produce clusters of many different sizes and shapes. Which of the following methods is the most appropriate?,Decision Trees,Density-based clustering,Model-based clustering,K-means clustering,B,False,0.015965567901730537,0.029827607795596123,0.03379910811781883,0.05572530999779701 "Statement 1| In AdaBoost weights of the misclassified examples go up by the same multiplicative factor. Statement 2| In AdaBoost, weighted training error e_t of the tth weak classifier on training data with weights D_t tends to increase as a function of t.","True, True","False, False","True, False","False, True",A,False,0.0066608283668756485,0.007547707762569189,0.004039997234940529,0.004577916115522385 MLE estimates are often undesirable because,they are biased,they have high variance,they are not consistent estimators,None of the above,B,False,0.025083720684051514,0.04686255753040314,0.0364965945482254,0.12738564610481262 "Computational complexity of Gradient descent is,",linear in D,linear in N,polynomial in D,dependent on the number of iterations,C,False,0.028647422790527344,0.032461781054735184,0.01968906633555889,0.025281261652708054 Averaging the output of multiple decision trees helps _.,Increase bias,Decrease bias,Increase variance,Decrease variance,D,False,0.03982022777199745,0.04512222483754158,0.021314231678843498,0.021314231678843498 The model obtained by applying linear regression on the identified subset of features may differ from the model obtained at the end of the process of identifying the subset during,Best-subset selection,Forward stepwise selection,Forward stage wise selection,All of the above,C,False,0.02434086613357067,0.02758181467652321,0.0214807391166687,0.04547472670674324 Neural networks:,Optimize a convex objective function,Can only be trained with stochastic gradient descent,Can use a mix of different activation functions,None of the above,C,False,0.009773345664143562,0.0076114884577691555,0.004914346616715193,0.026566704735159874 "Say the incidence of a disease D is about 5 cases per 100 people (i.e., P(D) = 0.05). Let Boolean random variable D mean a patient “has disease D” and let Boolean random variable TP stand for ""tests positive."" Tests for disease D are known to be very accurate in the sense that the probability of testing positive when you have the disease is 0.99, and the probability of testing negative when you do not have the disease is 0.97. What is P(TP), the prior probability of testing positive.",0.0368,0.473,0.078,None of the above,C,False,0.02888551913201809,0.02888551913201809,0.022496065124869347,0.047624167054891586 "Statement 1| After mapped into feature space Q through a radial basis kernel function, 1-NN using unweighted Euclidean distance may be able to achieve better classification performance than in original space (though we can’t guarantee this). Statement 2| The VC dimension of a Perceptron is smaller than the VC dimension of a simple linear SVM.","True, True","False, False","True, False","False, True",B,True,0.0083885183557868,0.0095054367557168,0.0057653384283185005,0.0057653384283185005 The disadvantage of Grid search is,It can not be applied to non-differentiable functions.,It can not be applied to non-continuous functions.,It is hard to implement.,It runs reasonably slow for multiple linear regression.,D,True,0.03363887593150139,0.06284569203853607,0.05546112731099129,0.071213498711586 Predicting the amount of rainfall in a region based on various cues is a ______ problem.,Supervised learning,Unsupervised learning,Clustering,None of the above,A,False,0.022119274362921715,0.02506442368030548,0.010448405519127846,0.013416018337011337 Which of the following sentence is FALSE regarding regression?,It relates inputs to outputs.,It is used for prediction.,It may be used for interpretation.,It discovers causal relationships,D,True,0.026592347770929337,0.03414525091648102,0.0301330778747797,0.03869163617491722 Which one of the following is the main reason for pruning a Decision Tree?,To save computing time during testing,To save space for storing the Decision Tree,To make the training set error smaller,To avoid overfitting the training set,D,True,0.029143480584025383,0.03742096945643425,0.02571903169155121,0.04240351915359497 Statement 1| The kernel density estimator is equivalent to performing kernel regression with the value Yi = 1/n at each point Xi in the original data set. Statement 2| The depth of a learned decision tree can be larger than the number of training examples used to create the tree.,"True, True","False, False","True, False","False, True",B,True,0.007569699082523584,0.008577593602240086,0.0045912545174360275,0.006680235732346773 Suppose your model is overfitting. Which of the following is NOT a valid way to try and reduce the overfitting?,Increase the amount of training data.,Improve the optimisation algorithm being used for error minimisation.,Decrease the model complexity.,Reduce the noise in the training data.,B,False,0.0361868254840374,0.0361868254840374,0.04100504890084267,0.05265152081847191 Statement 1| The softmax function is commonly used in mutliclass logistic regression. Statement 2| The temperature of a nonuniform softmax distribution affects its entropy.,"True, True","False, False","True, False","False, True",A,False,0.008359886705875397,0.009472993202507496,0.0065106856636703014,0.0065106856636703014 Which of the following is/are true regarding an SVM?,"For two dimensional data points, the separating hyperplane learnt by a linear SVM will be a straight line.","In theory, a Gaussian kernel SVM cannot model any complex separating hyperplane.","For every kernel function used in a SVM, one can obtain an equivalent closed form basis expansion.",Overfitting in an SVM is not a function of number of support vectors.,A,True,0.034091949462890625,0.009767508134245872,0.007146061863750219,0.014211639761924744 "Which of the following is the joint probability of H, U, P, and W described by the given Bayesian Network H -> U <- P <- W? [note: as the product of the conditional probabilities]","P(H, U, P, W) = P(H) * P(W) * P(P) * P(U)","P(H, U, P, W) = P(H) * P(W) * P(P | W) * P(W | H, P)","P(H, U, P, W) = P(H) * P(W) * P(P | W) * P(U | H, P)",None of the above,C,False,0.020757509395480156,0.034223347902297974,0.020757509395480156,0.034223347902297974 "Statement 1| Since the VC dimension for an SVM with a Radial Base Kernel is infinite, such an SVM must be worse than an SVM with polynomial kernel which has a finite VC dimension. Statement 2| A two layer neural network with linear activation functions is essentially a weighted combination of linear separators, trained on a given dataset; the boosting algorithm built on linear separators also finds a combination of linear separators, therefore these two algorithms will give the same result.","True, True","False, False","True, False","False, True",B,True,0.007466161157935858,0.008460269309580326,0.004528455901890993,0.004528455901890993 Statement 1| The ID3 algorithm is guaranteed to find the optimal decision tree. Statement 2| Consider a continuous probability distribution with density f() that is nonzero everywhere. The probability of a value x is equal to f(x).,"True, True","False, False","True, False","False, True",B,True,0.005850568879395723,0.007512279320508242,0.004556427709758282,0.005850568879395723 "Given a Neural Net with N input nodes, no hidden layers, one output node, with Entropy Loss and Sigmoid Activation Functions, which of the following algorithms (with the proper hyper-parameters and initialization) can be used to find the global optimum?",Stochastic Gradient Descent,Mini-Batch Gradient Descent,Batch Gradient Descent,All of the above,D,True,0.024749435484409332,0.03177890554070473,0.010317100211977959,0.03601021692156792 "Adding more basis functions in a linear model, pick the most probably option:",Decreases model bias,Decreases estimation bias,Decreases variance,Doesn’t affect bias and variance,A,False,0.0646430104970932,0.0646430104970932,0.05034402757883072,0.07325012981891632 Consider the Bayesian network given below. How many independent parameters would we need if we made no assumptions about independence or conditional independence H -> U <- P <- W?,3,4,7,15,D,True,0.012317998334765434,0.023013051599264145,0.020308945327997208,0.029549341648817062 Another term for out-of-distribution detection is?,anomaly detection,one-class detection,train-test mismatch robustness,background detection,A,True,0.025805694982409477,0.025805694982409477,0.00652470113709569,0.008377882651984692 "Statement 1| We learn a classifier f by boosting weak learners h. The functional form of f’s decision boundary is the same as h’s, but with different parameters. (e.g., if h was a linear classifier, then f is also a linear classifier). Statement 2| Cross validation can be used to select the number of iterations in boosting; this procedure may help reduce overfitting.","True, True","False, False","True, False","False, True",D,False,0.008495472371578217,0.009626630693674088,0.0045472984202206135,0.0045472984202206135 Statement 1| Highway networks were introduced after ResNets and eschew max pooling in favor of convolutions. Statement 2| DenseNets usually cost more memory than ResNets.,"True, True","False, False","True, False","False, True",D,False,0.008380785584449768,0.009496673941612244,0.005083202850073576,0.0065269614569842815 "If N is the number of instances in the training dataset, nearest neighbors has a classification run time of",O(1),O( N ),O(log N ),O( N^2 ),B,True,0.013438666239380836,0.017255587503314018,0.00634797615930438,0.011859580874443054 "Statement 1| The original ResNets and Transformers are feedforward neural networks. Statement 2| The original Transformers use self-attention, but the original ResNet does not.","True, True","False, False","True, False","False, True",A,False,0.008397446013987064,0.010782533325254917,0.005771473981440067,0.007410719525068998 "Statement 1| RELUs are not monotonic, but sigmoids are monotonic. Statement 2| Neural networks trained with gradient descent with high probability converge to the global optimum.","True, True","False, False","True, False","False, True",D,False,0.005920615047216415,0.00861444417387247,0.004610979463905096,0.007602219935506582 The numerical output of a sigmoid node in a neural network:,"Is unbounded, encompassing all real numbers.","Is unbounded, encompassing all integers.",Is bounded between 0 and 1.,Is bounded between -1 and 1.,C,True,0.006730129010975361,0.02833489701151848,0.0467163510620594,0.017185984179377556 Which of the following can only be used when training data are linearly separable?,Linear hard-margin SVM.,Linear Logistic Regression.,Linear Soft margin SVM.,The centroid method.,A,True,0.0358603298664093,0.03164663165807724,0.019194651395082474,0.03164663165807724 Which of the following are the spatial clustering algorithms?,Partitioning based clustering,K-means clustering,Grid based clustering,All of the above,D,True,0.017456023022532463,0.019780267030000687,0.010587613098323345,0.036954402923583984 Statement 1| The maximum margin decision boundaries that support vector machines construct have the lowest generalization error among all linear classifiers. Statement 2| Any decision boundary that we get from a generative model with classconditional Gaussian distributions could in principle be reproduced with an SVM and a polynomial kernel of degree less than or equal to three.,"True, True","False, False","True, False","False, True",D,False,0.00662295101210475,0.00750478683039546,0.0045518833212554455,0.005157959181815386 Statement 1| L2 regularization of linear models tends to make models more sparse than L1 regularization. Statement 2| Residual connections can be found in ResNets and Transformers.,"True, True","False, False","True, False","False, True",D,False,0.008511174470186234,0.009644423611462116,0.00662850821390748,0.008511174470186234 "Suppose we like to calculate P(H|E, F) and we have no conditional independence information. Which of the following sets of numbers are sufficient for the calculation?","P(E, F), P(H), P(E|H), P(F|H)","P(E, F), P(H), P(E, F|H)","P(H), P(E|H), P(F|H)","P(E, F), P(E|H), P(F|H)",B,False,0.04513576999306679,0.0511455312371254,0.02737622894346714,0.05795547738671303 Which among the following prevents overfitting when we perform bagging?,The use of sampling with replacement as the sampling technique,The use of weak classifiers,The use of classification algorithms which are not prone to overfitting,The practice of validation performed on every classifier trained,B,True,0.035117872059345245,0.04509223997592926,0.01658850722014904,0.027349824085831642 "Statement 1| PCA and Spectral Clustering (such as Andrew Ng’s) perform eigendecomposition on two different matrices. However, the size of these two matrices are the same. Statement 2| Since classification is a special case of regression, logistic regression is a special case of linear regression.","True, True","False, False","True, False","False, True",B,False,0.00945848785340786,0.00945848785340786,0.00506276311352849,0.007366277277469635 "Statement 1| The Stanford Sentiment Treebank contained movie reviews, not book reviews. Statement 2| The Penn Treebank has been used for language modeling.","True, True","False, False","True, False","False, True",A,False,0.006550196558237076,0.009530480951070786,0.005101298447698355,0.005780528299510479 "What is the dimensionality of the null space of the following matrix? A = [[3, 2, −9], [−6, −4, 18], [12, 8, −36]]",0,1,2,3,C,True,0.019275793805718422,0.03601192310452461,0.046240225434303284,0.015012003481388092 What are support vectors?,The examples farthest from the decision boundary.,The only examples necessary to compute f(x) in an SVM.,The data centroid.,All the examples that have a non-zero weight αk in a SVM.,B,False,0.03884962201118469,0.030256114900112152,0.01835126243531704,0.05652586743235588 Statement 1| Word2Vec parameters were not initialized using a Restricted Boltzman Machine. Statement 2| The tanh function is a nonlinear activation function.,"True, True","False, False","True, False","False, True",A,False,0.006626289803534746,0.008508325554430485,0.0058476803824305534,0.0075085703283548355 "If your training loss increases with number of epochs, which of the following could be a possible issue with the learning process?",Regularization is too low and model is overfitting,Regularization is too high and model is underfitting,Step size is too large,Step size is too small,C,False,0.037961557507514954,0.037961557507514954,0.020319359377026558,0.03350095823407173 "Say the incidence of a disease D is about 5 cases per 100 people (i.e., P(D) = 0.05). Let Boolean random variable D mean a patient “has disease D” and let Boolean random variable TP stand for ""tests positive."" Tests for disease D are known to be very accurate in the sense that the probability of testing positive when you have the disease is 0.99, and the probability of testing negative when you do not have the disease is 0.97. What is P(D | TP), the posterior probability that you have disease D when the test is positive?",0.0495,0.078,0.635,0.97,C,False,0.02234484814107418,0.025320028886198997,0.028691351413726807,0.04730404168367386 "Statement 1| Traditional machine learning results assume that the train and test sets are independent and identically distributed. Statement 2| In 2017, COCO models were usually pretrained on ImageNet.","True, True","False, False","True, False","False, True",A,False,0.004629272967576981,0.005944103933870792,0.00338684837333858,0.004629272967576981 "Statement 1| The values of the margins obtained by two different kernels K1(x, x0) and K2(x, x0) on the same training set do not tell us which classifier will perform better on the test set. Statement 2| The activation function of BERT is the GELU.","True, True","False, False","True, False","False, True",A,True,0.00942454393953085,0.008317130617797375,0.004451839253306389,0.0064773871563375 Which of the following is a clustering algorithm in machine learning?,Expectation Maximization,CART,Gaussian Naïve Bayes,Apriori,A,True,0.02543509006500244,0.02543509006500244,0.017481263726949692,0.022446388378739357 "You've just finished training a decision tree for spam classification, and it is getting abnormally bad performance on both your training and test sets. You know that your implementation has no bugs, so what could be causing the problem?",Your decision trees are too shallow.,You need to increase the learning rate.,You are overfitting.,None of the above.,A,False,0.011326703242957592,0.02717134915292263,0.05076276510953903,0.02397863194346428 K-fold cross-validation is,linear in K,quadratic in K,cubic in K,exponential in K,A,False,0.01928810030221939,0.024766409769654274,0.015021586790680885,0.01928810030221939 "Statement 1| Industrial-scale neural networks are normally trained on CPUs, not GPUs. Statement 2| The ResNet-50 model has over 1 billion parameters.","True, True","False, False","True, False","False, True",B,True,0.005854948423802853,0.008518899790942669,0.004283571615815163,0.0066345250234007835 "Given two Boolean random variables, A and B, where P(A) = 1/2, P(B) = 1/3, and P(A | ¬B) = 1/4, what is P(A | B)?",1/6,1/4,3/4,1,D,False,0.01848839782178402,0.01848839782178402,0.014398778788745403,0.01848839782178402 Existential risks posed by AI are most commonly associated with which of the following professors?,Nando de Frietas,Yann LeCun,Stuart Russell,Jitendra Malik,C,False,0.025624917820096016,0.02261390909552574,0.012104352936148643,0.02261390909552574 Statement 1| Maximizing the likelihood of logistic regression model yields multiple local optimums. Statement 2| No classifier can do better than a naive Bayes classifier if the distribution of the data is known.,"True, True","False, False","True, False","False, True",B,True,0.008335191756486893,0.00944500882178545,0.005728687159717083,0.007355780340731144 "For Kernel Regression, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:",Whether kernel function is Gaussian versus triangular versus box-shaped,Whether we use Euclidian versus L1 versus L∞ metrics,The kernel width,The maximum height of the kernel function,C,False,0.027784917503595352,0.021638916805386543,0.024520104750990868,0.03148443624377251 "Statement 1| The SVM learning algorithm is guaranteed to find the globally optimal hypothesis with respect to its object function. Statement 2| After being mapped into feature space Q through a radial basis kernel function, a Perceptron may be able to achieve better classification performance than in its original space (though we can’t guarantee this).","True, True","False, False","True, False","False, True",A,False,0.010343422181904316,0.011720632202923298,0.007108923047780991,0.00805546622723341 "For a Gaussian Bayes classifier, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:",Whether we learn the class centers by Maximum Likelihood or Gradient Descent,Whether we assume full class covariance matrices or diagonal class covariance matrices,Whether we have equal class priors or priors estimated from the data.,Whether we allow classes to have different mean vectors or we force them to share the same mean vector,B,False,0.03293402120471001,0.03293402120471001,0.03293402120471001,0.04791871830821037 Statement 1| Overfitting is more likely when the set of training data is small. Statement 2| Overfitting is more likely when the hypothesis space is small.,"True, True","False, False","True, False","False, True",D,False,0.0066828494891524315,0.008580950088799,0.005204608663916588,0.0066828494891524315 "Statement 1| Besides EM, gradient descent can be used to perform inference or learning on Gaussian mixture model. Statement 2 | Assuming a fixed number of attributes, a Gaussian-based Bayes optimal classifier can be learned in time linear in the number of records in the dataset.","True, True","False, False","True, False","False, True",A,False,0.0059278481639921665,0.006717131473124027,0.004074146971106529,0.005231307353824377 "Statement 1| In a Bayesian network, the inference results of the junction tree algorithm are the same as the inference results of variable elimination. Statement 2| If two random variable X and Y are conditionally independent given another random variable Z, then in the corresponding Bayesian network, the nodes for X and Y are d-separated given Z.","True, True","False, False","True, False","False, True",C,False,0.00848615076392889,0.007489001378417015,0.004542308859527111,0.00514711020514369 "Given a large dataset of medical records from patients suffering from heart disease, try to learn whether there might be different clusters of such patients for which we might tailor separate treatments. What kind of learning problem is this?",Supervised learning,Unsupervised learning,Both (a) and (b),Neither (a) nor (b),B,False,0.022448088973760605,0.022448088973760605,0.00825819093734026,0.00825819093734026 What would you do in PCA to get the same projection as SVD?,Transform data to zero mean,Transform data to zero median,Not possible,None of these,A,False,0.019765831530094147,0.013584842905402184,0.009336717426776886,0.03692743182182312 "Statement 1| The training error of 1-nearest neighbor classifier is 0. Statement 2| As the number of data points grows to infinity, the MAP estimate approaches the MLE estimate for all possible priors. In other words, given enough data, the choice of prior is irrelevant.","True, True","False, False","True, False","False, True",C,False,0.007448041345924139,0.008439736440777779,0.004517465364187956,0.005800540093332529 "When doing least-squares regression with regularisation (assuming that the optimisation can be done exactly), increasing the value of the regularisation parameter λ the testing error.",will never decrease the training error.,will never increase the training error.,will never decrease the testing error.,will never increase,A,False,0.02978479489684105,0.05564532428979874,0.0630544126033783,0.07145000994205475 Which of the following best describes what discriminative approaches try to model? (w are the parameters in the model),"p(y|x, w)","p(y, x)","p(w|x, w)",None of the above,A,True,0.04464317858219147,0.02389577031135559,0.008790763095021248,0.04464317858219147 Statement 1| CIFAR-10 classification performance for convolution neural networks can exceed 95%. Statement 2| Ensembles of neural networks do not improve classification accuracy since the representations they learn are highly correlated.,"True, True","False, False","True, False","False, True",C,False,0.008128383196890354,0.01043704990297556,0.01043704990297556,0.00921066477894783 Which of the following points would Bayesians and frequentists disagree on?,The use of a non-Gaussian noise model in probabilistic regression.,The use of probabilistic modelling for regression.,The use of prior distributions on the parameters in a probabilistic model.,The use of class priors in Gaussian Discriminant Analysis.,C,False,0.044645946472883224,0.03939991071820259,0.03939991071820259,0.044645946472883224 "Statement 1| The BLEU metric uses precision, while the ROGUE metric uses recall. Statement 2| Hidden markov models were frequently used to model English sentences.","True, True","False, False","True, False","False, True",A,False,0.0093830656260252,0.012048093602061272,0.00644887937232852,0.0073075382970273495 Statement 1| ImageNet has images of various resolutions. Statement 2| Caltech-101 has more images than ImageNet.,"True, True","False, False","True, False","False, True",C,False,0.006478782743215561,0.009426574222743511,0.00504568126052618,0.006478782743215561 Which of the following is more appropriate to do feature selection?,Ridge,Lasso,both (a) and (b),neither (a) nor (b),B,False,0.019679274410009384,0.019679274410009384,0.013525353744626045,0.02229953743517399 Suppose you are given an EM algorithm that finds maximum likelihood estimates for a model with latent variables. You are asked to modify the algorithm so that it finds MAP estimates instead. Which step or steps do you need to modify?,Expectation,Maximization,No modification necessary,Both,B,True,0.009949712082743645,0.023868119344115257,0.006424017250537872,0.023868119344115257 "For a Gaussian Bayes classifier, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:",Whether we learn the class centers by Maximum Likelihood or Gradient Descent,Whether we assume full class covariance matrices or diagonal class covariance matrices,Whether we have equal class priors or priors estimated from the data,Whether we allow classes to have different mean vectors or we force them to share the same mean vector,B,False,0.026906101033091545,0.026906101033091545,0.026906101033091545,0.03914814442396164 "Statement 1| For any two variables x and y having joint distribution p(x, y), we always have H[x, y] ≥ H[x] + H[y] where H is entropy function. Statement 2| For some directed graphs, moralization decreases the number of edges present in the graph.","True, True","False, False","True, False","False, True",B,False,0.006806625053286552,0.005301004741340876,0.0036433241330087185,0.005301004741340876 Which of the following is NOT supervised learning?,PCA,Decision Tree,Linear Regression,Naive Bayesian,A,False,0.033930279314517975,0.043567340821027756,0.02642492577433586,0.043567340821027756 Statement 1| A neural network's convergence depends on the learning rate. Statement 2| Dropout multiplies randomly chosen activation values by zero.,"True, True","False, False","True, False","False, True",A,True,0.013489232398569584,0.011904205195605755,0.008181633427739143,0.009271005168557167 "Which one of the following is equal to P(A, B, C) given Boolean random variables A, B and C, and no independence or conditional independence assumptions between any of them?",P(A | B) * P(B | C) * P(C | A),"P(C | A, B) * P(A) * P(B)","P(A, B | C) * P(C)","P(A | B, C) * P(B | A, C) * P(C | A, B)",C,False,0.047638729214668274,0.03274158388376236,0.03274158388376236,0.07854298502206802 Which of the following tasks can be best solved using Clustering.,Predicting the amount of rainfall based on various cues,Detecting fraudulent credit card transactions,Training a robot to solve a maze,All of the above,B,False,0.0138056930154562,0.010751884430646896,0.00837357621639967,0.03311813622713089 "After applying a regularization penalty in linear regression, you find that some of the coefficients of w are zeroed out. Which of the following penalties might have been used?",L0 norm,L1 norm,L2 norm,either (a) or (b),D,True,0.018865585327148438,0.027449263259768486,0.011442555114626884,0.03524555265903473 "A and B are two events. If P(A, B) decreases while P(A) increases, which of the following is true?",P(A|B) decreases,P(B|A) decreases,P(B) decreases,All of above,B,False,0.0458485372364521,0.027808543294668198,0.027808543294668198,0.06670922785997391 "Statement 1| When learning an HMM for a fixed set of observations, assume we do not know the true number of hidden states (which is often the case), we can always increase the training data likelihood by permitting more hidden states. Statement 2| Collaborative filtering is often a useful model for modeling users' movie preference.","True, True","False, False","True, False","False, True",A,False,0.008349892683327198,0.010721474885940552,0.005064465571194887,0.00650290260091424 "You are training a linear regression model for a simple estimation task, and notice that the model is overfitting to the data. You decide to add in $\ell_2$ regularization to penalize the weights. As you increase the $\ell_2$ regularization coefficient, what will happen to the bias and variance of the model?",Bias increase ; Variance increase,Bias increase ; Variance decrease,Bias decrease ; Variance increase,Bias decrease ; Variance decrease,B,True,0.03488589823246002,0.12176375091075897,0.021159369498491287,0.02716916613280773 "Which PyTorch 1.8 command(s) produce $10\times 5$ Gaussian matrix with each entry i.i.d. sampled from $\mathcal{N}(\mu=5,\sigma^2=16)$ and a $10\times 10$ uniform matrix with each entry i.i.d. sampled from $U[-1,1)$?","\texttt{5 + torch.randn(10,5) * 16} ; \texttt{torch.rand(10,10,low=-1,high=1)}","\texttt{5 + torch.randn(10,5) * 16} ; \texttt{(torch.rand(10,10) - 0.5) / 0.5}","\texttt{5 + torch.randn(10,5) * 4} ; \texttt{2 * torch.rand(10,10) - 1}","\texttt{torch.normal(torch.ones(10,5)*5,torch.ones(5,5)*16)} ; \texttt{2 * torch.rand(10,10) - 1}",C,False,0.033230509608983994,0.022838972508907318,0.012224821373820305,0.020155323669314384 "Statement 1| The ReLU's gradient is zero for $x<0$, and the sigmoid gradient $\sigma(x)(1-\sigma(x))\le \frac{1}{4}$ for all $x$. Statement 2| The sigmoid has a continuous gradient and the ReLU has a discontinuous gradient.","True, True","False, False","True, False","False, True",A,False,0.01057335827499628,0.015384145081043243,0.006413065362721682,0.009330956265330315 Which is true about Batch Normalization?,"After applying batch normalization, the layer’s activations will follow a standard Gaussian distribution.",The bias parameter of affine layers becomes redundant if a batch normalization layer follows immediately afterward.,The standard weight initialization must be changed when using Batch Normalization.,Batch Normalization is equivalent to Layer Normalization for convolutional neural networks.,B,True,0.06552644073963165,0.07425118237733841,0.02731548435986042,0.05103204771876335 Suppose we have the following objective function: $\argmin_{w} \frac{1}{2} \norm{Xw-y}^2_2 + \frac{1}{2}\gamma \norm{w}^2_2$ What is the gradient of $\frac{1}{2} \norm{Xw-y}^2_2 + \frac{1}{2}\lambda \norm{w}^2_2$ with respect to $w$?,$\nabla_w f(w) = (X^\top X + \lambda I)w - X^\top y + \lambda w$,$\nabla_w f(w) = X^\top X w - X^\top y + \lambda$,$\nabla_w f(w) = X^\top X w - X^\top y + \lambda w$,$\nabla_w f(w) = X^\top X w - X^\top y + (\lambda+1) w$,C,False,0.09734734147787094,0.05210627242922783,0.05210627242922783,0.04598362743854523 Which of the following is true of a convolution kernel?,Convolving an image with $\begin{bmatrix}1 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}$ would not change the image,Convolving an image with $\begin{bmatrix}0 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}$ would not change the image,Convolving an image with $\begin{bmatrix}1 & 1 & 1\\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix}$ would not change the image,Convolving an image with $\begin{bmatrix}0 & 0 & 0\\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix}$ would not change the image,B,False,0.020475631579756737,0.018069680780172348,0.014072681777179241,0.018069680780172348 Which of the following is false?,"Semantic segmentation models predict the class of each pixel, while multiclass image classifiers predict the class of entire image.",A bounding box with an IoU (intersection over union) equal to $96\%$ would likely be considered at true positive.,"When a predicted bounding box does not correspond to any object in the scene, it is considered a false positive.",A bounding box with an IoU (intersection over union) equal to $3\%$ would likely be considered at false negative.,D,True,0.03361520543694496,0.03361520543694496,0.03361520543694496,0.04316278174519539 Which of the following is false?,"The following fully connected network without activation functions is linear: $g_3(g_2(g_1(x)))$, where $g_i(x) = W_i x$ and $W_i$ are matrices.","Leaky ReLU $\max\{0.01x,x\}$ is convex.",A combination of ReLUs such as $ReLU(x) - ReLU(x-1)$ is convex.,The loss $\log \sigma(x)= -\log(1+e^{-x})$ is concave,C,False,0.01943887211382389,0.024960007518529892,0.02202712744474411,0.032049283385276794 "We are training fully connected network with two hidden layers to predict housing prices. Inputs are $100$-dimensional, and have several features such as the number of square feet, the median family income, etc. The first hidden layer has $1000$ activations. The second hidden layer has $10$ activations. The output is a scalar representing the house price. Assuming a vanilla network with affine transformations and with no batch normalization and no learnable parameters in the activation function, how many parameters does this network have?",111021,110010,111110,110011,A,False,0.017553316429257393,0.02253890410065651,0.01549074612557888,0.02894052490592003 Statement 1| The derivative of the sigmoid $\sigma(x)=(1+e^{-x})^{-1}$ with respect to $x$ is equal to $\text{Var}(B)$ where $B\sim \text{Bern}(\sigma(x))$ is a Bernoulli random variable. Statement 2| Setting the bias parameters in each layer of neural network to 0 changes the bias-variance trade-off such that the model's variance increases and the model's bias decreases,"True, True","False, False","True, False","False, True",C,False,0.012336432002484798,0.009607624262571335,0.006603216286748648,0.007482424378395081