McqMate
1. |
A CUDA program is comprised of two primary components: a host and a _____. |
A. | gpu??kernel |
B. | cpu??kernel |
C. | os |
D. | none of above |
Answer» A. gpu??kernel |
2. |
The kernel code is dentified by the ________qualifier with void return type |
A. | _host_ |
B. | __global__?? |
C. | _device_ |
D. | void |
Answer» B. __global__?? |
3. |
The kernel code is only callable by the host |
A. | true |
B. | false |
Answer» A. true |
4. |
The kernel code is executable on the device and host |
A. | true |
B. | false |
Answer» B. false |
5. |
Calling a kernel is typically referred to as _________. |
A. | kernel thread |
B. | kernel initialization |
C. | kernel termination |
D. | kernel invocation |
Answer» D. kernel invocation |
6. |
Host codes in a CUDA application can Initialize a device |
A. | true |
B. | false |
Answer» A. true |
7. |
Host codes in a CUDA application can Allocate GPU memory |
A. | true |
B. | false |
Answer» A. true |
8. |
Host codes in a CUDA application can not Invoke kernels |
A. | true |
B. | false |
Answer» B. false |
9. |
CUDA offers the Chevron Syntax to configure and execute a kernel. |
A. | true |
B. | false |
Answer» A. true |
10. |
the BlockPerGrid and ThreadPerBlock parameters are related to the ________ model supported by CUDA. |
A. | host |
B. | kernel |
C. | thread??abstraction |
D. | none of above |
Answer» C. thread??abstraction |
11. |
_________ is Callable from the device only |
A. | _host_ |
B. | __global__?? |
C. | _device_ |
D. | none of above |
Answer» C. _device_ |
12. |
______ is Callable from the host |
A. | _host_ |
B. | __global__?? |
C. | _device_ |
D. | none of above |
Answer» B. __global__?? |
13. |
______ is Callable from the host |
A. | _host_ |
B. | __global__ |
C. | _device_ |
D. | none of above |
Answer» A. _host_ |
14. |
CUDA supports ____________ in which code in a single thread is executed by all other threads. |
A. | tread division |
B. | tread termination |
C. | thread abstraction |
D. | none of above |
Answer» C. thread abstraction |
15. |
In CUDA, a single invoked kernel is referred to as a _____. |
A. | block |
B. | tread |
C. | grid |
D. | none of above |
Answer» C. grid |
16. |
A grid is comprised of ________ of threads. |
A. | block |
B. | bunch |
C. | host |
D. | none of above |
Answer» A. block |
17. |
A block is comprised of multiple _______. |
A. | treads |
B. | bunch |
C. | host |
D. | none of above |
Answer» A. treads |
18. |
a solution of the problem in representing the parallelismin algorithm is |
A. | cud |
B. | pta |
C. | cda |
D. | cuda |
Answer» D. cuda |
19. |
Host codes in a CUDA application can not Reset a device |
A. | true |
B. | false |
Answer» B. false |
20. |
Host codes in a CUDA application can Transfer data to and from the device |
A. | true |
B. | false |
Answer» A. true |
21. |
Host codes in a CUDA application can not Deallocate memory on the GPU |
A. | true |
B. | false |
Answer» B. false |
22. |
Any condition that causes a processor to stall is called as _____. |
A. | hazard |
B. | page fault |
C. | system error |
D. | none of the above |
Answer» A. hazard |
23. |
The time lost due to branch instruction is often referred to as _____. |
A. | latency |
B. | delay |
C. | branch penalty |
D. | none of the above |
Answer» C. branch penalty |
24. |
_____ method is used in centralized systems to perform out of order execution. |
A. | scorecard |
B. | score boarding |
C. | optimizing |
D. | redundancy |
Answer» B. score boarding |
25. |
The computer cluster architecture emerged as an alternative for ____. |
A. | isa |
B. | workstation |
C. | super computers |
D. | distributed systems |
Answer» C. super computers |
26. |
NVIDIA CUDA Warp is made up of how many threads? |
A. | 512 |
B. | 1024 |
C. | 312 |
D. | 32 |
Answer» D. 32 |
27. |
Out-of-order instructions is not possible on GPUs. |
A. | true |
B. | false |
C. | -- |
D. | -- |
Answer» B. false |
28. |
CUDA supports programming in .... |
A. | c or c++ only |
B. | java, python, and more |
C. | c, c++, third party wrappers for java, python, and more |
D. | pascal |
Answer» C. c, c++, third party wrappers for java, python, and more |
29. |
FADD, FMAD, FMIN, FMAX are ----- supported by Scalar Processors of NVIDIA GPU. |
A. | 32-bit ieee floating point instructions |
B. | 32-bit integer instructions |
C. | both |
D. | none of the above |
Answer» A. 32-bit ieee floating point instructions |
30. |
Each streaming multiprocessor (SM) of CUDA herdware has ------ scalar processors (SP). |
A. | 1024 |
B. | 128 |
C. | 512 |
D. | 8 |
Answer» D. 8 |
31. |
Each NVIDIA GPU has ------ Streaming Multiprocessors |
A. | 8 |
B. | 1024 |
C. | 512 |
D. | 16 |
Answer» D. 16 |
32. |
CUDA provides ------- warp and thread scheduling. Also, the overhead of thread creation is on the order of ----. |
A. | “programming-overhead”, 2 clock |
B. | “zero-overhead”, 1 clock |
C. | 64, 2 clock |
D. | 32, 1 clock |
Answer» B. “zero-overhead”, 1 clock |
33. |
Each warp of GPU receives a single instruction and “broadcasts” it to all of its threads. It is a ---- operation. |
A. | simd (single instruction multiple data) |
B. | simt (single instruction multiple thread) |
C. | sisd (single instruction single data) |
D. | sist (single instruction single thread) |
Answer» B. simt (single instruction multiple thread) |
34. |
Limitations of CUDA Kernel |
A. | recursion, call stack, static variable declaration |
B. | no recursion, no call stack, no static variable declarations |
C. | recursion, no call stack, static variable declaration |
D. | no recursion, call stack, no static variable declarations |
Answer» B. no recursion, no call stack, no static variable declarations |
35. |
What is Unified Virtual Machine |
A. | it is a technique that allow both cpu and gpu to read from single virtual machine, simultaneously. |
B. | it is a technique for managing separate host and device memory spaces. |
C. | it is a technique for executing device code on host and host code on device. |
D. | it is a technique for executing general purpose programs on device instead of host. |
Answer» A. it is a technique that allow both cpu and gpu to read from single virtual machine, simultaneously. |
36. |
_______ became the first language specifically designed by a GPU Company to facilitate general purpose computing on ____. |
A. | python, gpus. |
B. | c, cpus. |
C. | cuda c, gpus. |
D. | java, cpus. |
Answer» C. cuda c, gpus. |
37. |
The CUDA architecture consists of --------- for parallel computing kernels and functions. |
A. | risc instruction set architecture |
B. | cisc instruction set architecture |
C. | zisc instruction set architecture |
D. | ptx instruction set architecture |
Answer» D. ptx instruction set architecture |
38. |
CUDA stands for --------, designed by NVIDIA. |
A. | common union discrete architecture |
B. | complex unidentified device architecture |
C. | compute unified device architecture |
D. | complex unstructured distributed architecture |
Answer» C. compute unified device architecture |
39. |
The host processor spawns multithread tasks (or kernels as they are known in CUDA) onto the GPU device. State true or false. |
A. | true |
B. | false |
C. | --- |
D. | --- |
Answer» A. true |
40. |
The NVIDIA G80 is a ---- CUDA core device, the NVIDIA G200 is a ---- CUDA core device, and the NVIDIA Fermi is a ---- CUDA core device. |
A. | 128, 256, 512 |
B. | 32, 64, 128 |
C. | 64, 128, 256 |
D. | 256, 512, 1024 |
Answer» A. 128, 256, 512 |
41. |
NVIDIA 8-series GPUs offer -------- . |
A. | 50-200 gflops |
B. | 200-400 gflops |
C. | 400-800 gflops |
D. | 800-1000 gflops |
Answer» A. 50-200 gflops |
42. |
IADD, IMUL24, IMAD24, IMIN, IMAX are ----------- supported by Scalar Processors of NVIDIA GPU. |
A. | 32-bit ieee floating point instructions |
B. | 32-bit integer instructions |
C. | both |
D. | none of the above |
Answer» B. 32-bit integer instructions |
43. |
CUDA Hardware programming model supports:
|
A. | a,c,d,f |
B. | b,c,d,e |
C. | a,d,e,f |
D. | a,b,c,d,e,f |
Answer» D. a,b,c,d,e,f |
44. |
In CUDA memory model there are following memory types available:
|
A. | a, b, d, f |
B. | a, c, d, e, f |
C. | a, b, c, d, e, f |
D. | b, c, e, f |
Answer» C. a, b, c, d, e, f |
45. |
What is the equivalent of general C program with CUDA C: int main(void) { printf("Hello, World!\n"); return 0; } |
A. | int main ( void ) { kernel <<<1,1>>>(); printf("hello, world!\\n"); return 0; } |
B. | __global__ void kernel( void ) { } int main ( void ) { kernel <<<1,1>>>(); printf("hello, world!\\n"); return 0; } |
C. | __global__ void kernel( void ) { kernel <<<1,1>>>(); printf("hello, world!\\n"); return 0; } |
D. | __global__ int main ( void ) { kernel <<<1,1>>>(); printf("hello, world!\\n"); return 0; } |
Answer» B. __global__ void kernel( void ) { } int main ( void ) { kernel <<<1,1>>>(); printf("hello, world!\\n"); return 0; } |
46. |
Which function runs on Device (i.e. GPU): a) __global__ void kernel (void ) { } b) int main ( void ) { ... return 0; } |
A. | a |
B. | b |
C. | both a,b |
D. | --- |
Answer» A. a |
47. |
A simple kernel for adding two integers: __global__ void add( int *a, int *b, int *c ) { *c = *a + *b; } where __global__ is a CUDA C keyword which indicates that: |
A. | add() will execute on device, add() will be called from host |
B. | add() will execute on host, add() will be called from device |
C. | add() will be called and executed on host |
D. | add() will be called and executed on device |
Answer» A. add() will execute on device, add() will be called from host |
48. |
If variable a is host variable and dev_a is a device (GPU) variable, to allocate memory to dev_a select correct statement: |
A. | cudamalloc( &dev_a, sizeof( int ) ) |
B. | malloc( &dev_a, sizeof( int ) ) |
C. | cudamalloc( (void**) &dev_a, sizeof( int ) ) |
D. | malloc( (void**) &dev_a, sizeof( int ) ) |
Answer» C. cudamalloc( (void**) &dev_a, sizeof( int ) ) |
49. |
If variable a is host variable and dev_a is a device (GPU) variable, to copy input from variable a to variable dev_a select correct statement: |
A. | memcpy( dev_a, &a, size); |
B. | cudamemcpy( dev_a, &a, size, cudamemcpyhosttodevice ); |
C. | memcpy( (void*) dev_a, &a, size); |
D. | cudamemcpy( (void*) &dev_a, &a, size, cudamemcpydevicetohost ); |
Answer» B. cudamemcpy( dev_a, &a, size, cudamemcpyhosttodevice ); |
50. |
Triple angle brackets mark in a statement inside main function, what does it indicates? |
A. | a call from host code to device code |
B. | a call from device code to host code |
C. | less than comparison |
D. | greater than comparison |
Answer» A. a call from host code to device code |
51. |
What makes a CUDA code runs in parallel |
A. | __global__ indicates parallel execution of code |
B. | main() function indicates parallel execution of code |
C. | kernel name outside triple angle bracket indicates excecution of kernel n times in parallel |
D. | first parameter value inside triple angle bracket (n) indicates excecution of kernel n times in parallel |
Answer» D. first parameter value inside triple angle bracket (n) indicates excecution of kernel n times in parallel |
52. |
In ___________, the number of elements to be sorted is small enough to fit into the process's main memory. |
A. | internal sorting |
B. | internal searching |
C. | external sorting |
D. | external searching |
Answer» A. internal sorting |
53. |
______________ algorithms use auxiliary storage (such as tapes and hard disks) for sorting because the number of elements to be sorted is too large to fit into memory. |
A. | internal sorting |
B. | internal searching |
C. | external sorting |
D. | external searching |
Answer» C. external sorting |
54. |
______ can be comparison-based or noncomparison-based. |
A. | searching |
B. | sorting |
C. | both a and b |
D. | none of above |
Answer» B. sorting |
55. |
The fundamental operation of comparison-based sorting is ________. |
A. | compare-exchange |
B. | searching |
C. | sorting |
D. | swapping |
Answer» A. compare-exchange |
56. |
The complexity of bubble sort is Θ(n2). |
A. | true |
B. | false |
Answer» A. true |
57. |
Bubble sort is difficult to parallelize since the algorithm has no concurrency. |
A. | true |
B. | false |
Answer» A. true |
58. |
Quicksort is one of the most common sorting algorithms for sequential computers because of its simplicity, low overhead, and optimal average complexity. |
A. | true |
B. | false |
Answer» A. true |
59. |
The performance of quicksort depends critically on the quality of the ______-. |
A. | non-pivote |
B. | pivot |
C. | center element |
D. | len of array |
Answer» B. pivot |
60. |
the complexity of quicksort is O(nlog n). |
A. | true |
B. | false |
Answer» A. true |
61. |
The main advantage of ______ is that its storage requirement is linear in the depth of the state space being searched. |
A. | bfs |
B. | dfs |
C. | a and b |
D. | none of above |
Answer» B. dfs |
62. |
_____ algorithms use a heuristic to guide search. |
A. | bfs |
B. | dfs |
C. | a and b |
D. | none of above |
Answer» A. bfs |
63. |
If the heuristic is admissible, the BFS finds the optimal solution. |
A. | true |
B. | false |
Answer» A. true |
64. |
The search overhead factor of the parallel system is defined as the ratio of the work done by the parallel formulation to that done by the sequential formulation |
A. | true |
B. | false |
Answer» A. true |
65. |
The critical issue in parallel depth-first search algorithms is the distribution of the search space among the processors. |
A. | true |
B. | false |
Answer» A. true |
66. |
Graph search involves a closed list, where the major operation is a _______ |
A. | sorting |
B. | searching |
C. | lookup |
D. | none of above |
Answer» C. lookup |
67. |
Breadth First Search is equivalent to which of the traversal in the Binary Trees? |
A. | pre-order traversal |
B. | post-order traversal |
C. | level-order traversal |
D. | in-order traversal |
Answer» C. level-order traversal |
68. |
Time Complexity of Breadth First Search is? (V – number of vertices, E – number of edges) |
A. | o(v + e) |
B. | o(v) |
C. | o(e) |
D. | o(v*e) |
Answer» A. o(v + e) |
69. |
Which of the following is not an application of Breadth First Search? |
A. | when the graph is a binary tree |
B. | when the graph is a linked list |
C. | when the graph is a n-ary tree |
D. | when the graph is a ternary tree |
Answer» B. when the graph is a linked list |
70. |
In BFS, how many times a node is visited? |
A. | once |
B. | twice |
C. | equivalent to number of indegree of the node |
D. | thrice |
Answer» C. equivalent to number of indegree of the node |
71. |
Is Best First Search a searching algorithm used in graphs. |
A. | true |
B. | false |
Answer» A. true |
72. |
Which of the following is not a stable sorting algorithm in its typical implementation. |
A. | insertion sort |
B. | merge sort |
C. | quick sort |
D. | bubble sort |
Answer» C. quick sort |
73. |
Which of the following is not true about comparison based sorting algorithms? |
A. | the minimum possible time complexity of a comparison based sorting algorithm is o(nlogn) for a random input array |
B. | any comparison based sorting algorithm can be made stable by using position as a criteria when two elements are compared |
C. | counting sort is not a comparison based sorting algortihm |
D. | heap sort is not a comparison based sorting algorithm. |
Answer» D. heap sort is not a comparison based sorting algorithm. |
74. |
mathematically efficiency is |
A. | e=s/p |
B. | e=p/s |
C. | e*s=p/2 |
D. | e=p+e/e |
Answer» A. e=s/p |
75. |
Cost of a parallel system is sometimes referred to____ of product |
A. | work |
B. | processor time |
C. | both |
D. | none |
Answer» C. both |
76. |
Scaling Characteristics of Parallel Programs Ts is |
A. | increase |
B. | constant |
C. | decreases |
D. | none |
Answer» B. constant |
77. |
Speedup tends to saturate and efficiency _____ as a consequence of Amdahl’s law. |
A. | increase |
B. | constant |
C. | decreases |
D. | none |
Answer» C. decreases |
78. |
Speedup obtained when the problem size is _______ linearly with the number of processing elements. |
A. | increase |
B. | constant |
C. | decreases |
D. | depend on problem size |
Answer» A. increase |
79. |
The n × n matrix is partitioned among n processors, with each processor storing complete ___ of the matrix. |
A. | row |
B. | column |
C. | both |
D. | depend on processor |
Answer» A. row |
80. |
cost-optimal parallel systems have an efficiency of ___ |
A. | 1 |
B. | n |
C. | logn |
D. | complex |
Answer» A. 1 |
81. |
The n × n matrix is partitioned among n2 processors such that each processor owns a _____ element. |
A. | n |
B. | 2n |
C. | single |
D. | double |
Answer» C. single |
82. |
how many basic communication operations are used in matrix vector multiplication |
A. | 1 |
B. | 2 |
C. | 3 |
D. | 4 |
Answer» C. 3 |
83. |
In DNS algorithm of matrix multiplication it used |
A. | 1d partition |
B. | 2d partition |
C. | 3d partition |
D. | both a,b |
Answer» C. 3d partition |
84. |
In the Pipelined Execution, steps contain |
A. | normalization |
B. | communication |
C. | elimination |
D. | all |
Answer» D. all |
85. |
the cost of the parallel algorithm is higher than the sequential run time by a factor of __ |
A. | 2020-03-02 00:00:00 |
B. | 2020-02-03 00:00:00 |
C. | 3*2 |
D. | 2/3+3/2 |
Answer» A. 2020-03-02 00:00:00 |
86. |
The load imbalance problem in Parallel Gaussian Elimination: can be alleviated by using a ____ mapping |
A. | acyclic |
B. | cyclic |
C. | both |
D. | none |
Answer» B. cyclic |
87. |
A parallel algorithm is evaluated by its runtime in function of |
A. | the input size, |
B. | the number of processors, |
C. | the communication parameters. |
D. | all |
Answer» D. all |
88. |
For a problem consisting of W units of work, p__W processors can be used optimally. |
A. | <= |
B. | >= |
C. | < |
D. | > |
Answer» A. <= |
89. |
C(W)__Θ(W) for optimality (necessary condition). |
A. | > |
B. | < |
C. | <= |
D. | equals |
Answer» D. equals |
90. |
many interactions in oractical parallel programs occur in _____ pattern |
A. | well defined |
B. | zig-zac |
C. | reverse |
D. | straight |
Answer» A. well defined |
91. |
efficient implementation of basic communication operation can improve |
A. | performance |
B. | communication |
C. | algorithm |
D. | all |
Answer» A. performance |
92. |
efficient use of basic communication operations can reduce |
A. | development effort and |
B. | software quality |
C. | both |
D. | none |
Answer» A. development effort and |
93. |
Group communication operations are built using_____ Messenging primitives. |
A. | point-to-point |
B. | one-to-all |
C. | all-to-one |
D. | none |
Answer» A. point-to-point |
94. |
one processor has a piece of data and it need to send to everyone is |
A. | one -to-all |
B. | all-to-one |
C. | point -to-point |
D. | all of above |
Answer» A. one -to-all |
95. |
the dual of one -to-all is |
A. | all-to-one reduction |
B. | one -to-all reduction |
C. | pnoint -to-point reducntion |
D. | none |
Answer» A. all-to-one reduction |
96. |
Data items must be combined piece-wise and the result made available at |
A. | target processor finally |
B. | target variable finatlalyrget receiver finally |
Answer» A. target processor finally |
97. |
wimpleat way to send p-1 messages from source to the other p-1 processors |
A. | algorithm |
B. | communication |
C. | concurrency |
D. | receiver |
Answer» C. concurrency |
98. |
In a eight node ring, node ____ is source of broadcast |
A. | 1 |
B. | 2 |
C. | 8 |
D. | 0 |
Answer» D. 0 |
99. |
The processors compute ______ product of the vector element and the loval matrix |
A. | local |
B. | global |
C. | both |
D. | none |
Answer» A. local |
100. |
one to all broadcast use |
A. | recursive doubling |
B. | simple algorithm |
C. | both |
D. | none |
Answer» A. recursive doubling |
Done Reading?