Q1 · a · 9 Marks

What is driving data deluge? Explain with one example.

Q1 · b · 9 Marks

What is data science? Differentiate between Business Intelligence and Data Science.

Q2 · a · 9 Marks

What are the sources of Big Data. Explain model building phase with example.

Q2 · b · 9 Marks

Explain big data analytics architecture with diagram. What is data discovery phase. Explain with example.

Q3 · a · 8 Marks

Explain various data pre-processing steps. Discuss essential python libraries for preprocessing.

Q3 · b · 9 Marks

What are association rules? Explain Apriori Algorithm in brief.

Q4 · a · 8 Marks

Explain the following i) Linear Regression ii) Logistic Regression

Q4 · b · 9 Marks

Explain scikit-learn library for matplotlib with example.

Q5 · a · 9 Marks

Write short note on i) Time series Analysis ii) TF - IDF.

Q5 · b · 9 Marks

What is clustering? With suitable example explain the steps involved in k - means algorithm.

Q6 · a · 9 Marks

Write short note on i) Confusion matrix ii) AVC - ROC curve

Q6 · b · 9 Marks

Discuss Holdout method and Random Sub Sampling methods.

Q7 · a · 8 Marks

With a suitable example explain Histogram and explain its usages.

Q7 · b · 9 Marks

Describe the Data visualization tool “Tableau”. Explain its applications in brief.

Q8 · a · 8 Marks

With a suitable example explain and draw a Box plot and explain its usages.

Q8 · b · 9 Marks

Describe the challenges of data visualization. Draw box plot and explain its usages.

Subject Name	Data Science And Big Data Analytics
Semester	II
Pattern Year	2019
Subject Code	310251
Max Marks	70
Total Questions	8
Duration	2½ Hours
Paper Number	[5870] - 1133
Academic Year	T.E.
Branch Name	Computer Engineering
Exam Type	ENDSEM
Exam Session	2022 May Jun Endsem
Watermark	['CEGP013091', '49.248.216.238 29/06/2022 08:36:57 static-238']

Q1 · a · 9 Marks

What is Model Building elaborate this phase of data analytics with the help of suitable example.

Q1 · b · 8 Marks

Explain any three sources of Big Data. Differentiate BI versus Data science.

Q2 · a · 8 Marks

What are the three characteristic of Big Data and what are the main consideration in processing Big Data.

Q2 · b · 9 Marks

Explain Descriptive, Diagnostic, Predictive analytics.

Q3 · a · 9 Marks

Explain why decision tree are used. Draw a sample decision tree and explain its parts.

Q3 · b · 9 Marks

How Apriori Algorithm works, explain with suitable example?

Q4 · a · 9 Marks

What is data preprocessing? Explain in details about handling missing data and transformation of data.

Q4 · b · 9 Marks

Explain Naïve Bayes’ classifier and it applications.

Q5 · a · 8 Marks

What is text processing? Explain TF-IDF with example.

Q5 · b · 9 Marks

With suitable example ,explain the steps involved in k-means algorithm.

Q6 · a · 8 Marks

Define following terms with respect to confusion matrix : i) Accuracy ii) Precision iii) Recall iv) AUC-ROC

Q6 · b · 9 Marks

Explain k-fold Cross Validation & Random Subsampling.

Q7 · a · 9 Marks

With a suitable example, draw a Histogram, boxplot and explain its usages.

Q7 · b · 9 Marks

Describe the data visualization tool Tableau. List of data visualization tools.

Q8 · a · 9 Marks

What is Data Visualization? Describe the challenges of data visualization.

Q8 · b · 9 Marks

Explain architecture of Apache-Pig.

Subject Name	Data Science And Big Data Analytics
Semester	II
Pattern Year	2019
Subject Code	310251
Max Marks	70
Total Questions	8
Duration	2½ Hours
Paper Number	[6003]-354
Academic Year	T.E.
Branch Name	Computer Engineering
Exam Type	ENDSEM
Exam Session	2023 May Jun Endsem
Watermark	['CEGP013091', '49.248.216.238 21/06/2023 10:32:30 static-238']

Q1 · a · 8 Marks

What is the data Preparation phase in Data Analytics Lifecycle. What is the Analytics Sandbox and ETLT process in this phase?

Q1 · b · 8 Marks

List out different stakeholders of an analytics project. What they usually expect at the conclusion (key outputs) of a project?

Q2 · a · 8 Marks

List out the activities to be carried out in model planning and model building phase. What are different tools used for these phases?

Q2 · b · 8 Marks

What is linear regression, and what are its primary objectives? What is the difference between simple linear regression and multiple linear regression? How do you evaluate the performance of linear regression?

Q3 · a · 9 Marks

What is logistic regression, and how does it differ from linear regression? What is the sigmoid function, and what role does it play in logistic regression?

Q3 · b · 9 Marks

Suppose you are given a dataset containing information about whether emails are spam or not spam, along with two features: the presence of the word "offer" (1 for present, 0 for absent) and the presence of the word "free" (1 for present, 0 for absent). You are tasked with classifying a new email with the following feature values: "offer"=1 and "free"=1. Given the training dataset: Email Offer Free Spam 1 1 0 No 2 0 1 Yes 3 1 1 Yes 4 0 1 No 5 1 1 Yes Calculate the probability that the new email is spam using Naive Bayes.

Q4 · a · 9 Marks

How does the Apriori algorithm discover frequent itemsets in a dataset? What is the role of support and confidence in the context of association rule mining using the Apriori algoritm?

Q4 · b · 9 Marks

Explain the process of building a decision tree? What are the criteria used for splitting nodes in a decision tree?

Q5 · a · 9 Marks

Suppose you have the following dataset containing the coordinates of points in a 2-dimensional space: Point X Coordinate Y Coordinate A 2 3 B 4 7 C 3 5 D 6 9 E 8 6 F 7 8 Perform K-means clustering on this dataset with K = 2. Assume the initial centroids to be (2,3) and (8,6). Compute the new centroids after each iteration until convergence, and assign points to their nearest centroids.

Q5 · b · 9 Marks

How do you handle noise and irrelevant information in text data during preprocessing? Explain the terms bag of words and TF IDF in text analytics.

Q6 · a · 9 Marks

Explain how hierarchical clustering can be used for visualizing hierarchical relationships in data with suitable example? What are some real-world applications of hierarchical clustering?

Q6 · b · 9 Marks

What is the holdout method, and how does it work? Explain the difference between training set, validation set, and test set in the holdout method.

Q7 · a · 9 Marks

What is a histogram? How is it used to visualize the distribution of data? How is it different from a density plot?

Q7 · b · 9 Marks

What is the Hadoop ecosystem, and what are its primary components? What is MapReduce, and how does it fit into the Hadoop ecosystem?

Q8 · a · 9 Marks

What is a box plot? Explain the different components of a box plot? How do you interpret the median, quartiles, and whiskers in a box plot? What does the interquartile range (IQR) represent in a box plot?

Q8 · b · 9 Marks

Explain the role of Apache Pig in data processing workflows on Hadoop? What is Apache Spark, and how does it complement Hadoop for big data processing?

Subject Name	Data Science And Big Data Analytics
Semester	II
Pattern Year	2019
Subject Code	310251
Max Marks	70
Total Questions	8
Duration	2½ Hours
Paper Number	[6262]-43
Academic Year	T.E.
Branch Name	Computer Engineering
Exam Type	ENDSEM
Exam Session	2024 May Jun Endsem
Watermark	['CEGP013091', '49.248.216.238 15/05/2024 09:42:06 static-238']

Q1 · a · 8 Marks

What is Model Building elaborate this phase of data analytics with the help of a suitable example?

Q1 · b · 9 Marks

List out different stakeholders of an analytics project. What do they usually expect at the conclusion (key outputs) of a project?

Q2 · a · 8 Marks

Explain Descriptive, Diagnostic, Predictive analytics.

Q2 · b · 9 Marks

List and explain the various activities involved in identifying potential data resources as a part of discovery phase in Data Analytics Life Cycle?

Q3 · a · 9 Marks

What is association rule mining? Describe the working of the Apriori algorithm with an example.

Q3 · b · 9 Marks

Explain how decision trees are constructed using information gain and entropy. Illustrate with a small example.

Q4 · a · 9 Marks

Explain Naïve Bayes’ classifier and its applications.

Q4 · b · 9 Marks

Consider a dataset with binary classes and two features: “Loan Amount” and “Default History.” Show how logistic regression could be applied for loan default prediction.

Q5 · a · 8 Marks

Explain the holdout method. Differentiate training set, validation set, and test set.

Q5 · b · 9 Marks

Given the confusion matrix below. Calculate Accuracy, Precision, Recall and F1-score. Predicted Yes Predicted No Actual Yes 70 30 Actual No 20 80

Q6 · a · 8 Marks

Explain the following Text Analysis steps with suitable example i) Part-of-speech (POS) tagging ii) Lemmatization

Q6 · b · 9 Marks

Use K-Means Clustering for the following points and determine the centroids after one iteration. Assume initial centroids as A(1,l), B(5,7). Points: (1,2), (2,1), (3,5), (6,8), (7,6), (5,5)

Q7 · a · 9 Marks

Explain Hadoop Architecture with a neat diagram. Highlight the roles of NameNode and DataNode.

Q7 · b · 9 Marks

Compare Tableau, Power BI, and Matplotlib for data visualization. Discuss scenarios where each tool is best suited.

Q8 · a · 9 Marks

What is Data Visualization? Describe the challenges of data visualization.

Q8 · b · 9 Marks

Write short notes on the following : i) Map Reduce ii) HDFS iii) Hive

Subject Name	Data Science And Big Data Analytics
Semester	VI
Pattern Year	2019
Subject Code	310251
Max Marks	70
Total Questions	8
Duration	2½ Hours
Paper Number	[6403]-43
Academic Year	T.E.
Branch Name	Computer Engineering
Exam Type	ENDSEM
Exam Session	2025 May Jun Endsem
Watermark	['CEGP013091', '49.248.216.237 26/05/2025 09:37:42 static-237']

Q1 · a · 8 Marks

Draw the diagram of data analytics life cycle in big data and briefly explain its phases.

Q1 · b · 9 Marks

Explain in detail how the model building phase is built by team in data analytics life cycle?

Q2 · a · 8 Marks

List and explain the steps in data preparation phase of data analytics life cycle.

Q2 · b · 9 Marks

Write short note on the following: i) ETL ii) Common tools for the model building. iii) Model selection for data analytics.

Q3 · a · 9 Marks

What are the types of analytics in big data? Explain in brief.

Q3 · b · 9 Marks

Calculate the support and confidence value for all the possible item sets. Transaction ID 1: Onion, Potato, Cold drink; 2: Onion, Burger, Cold drink; 3: Eggs, Onion, Cold drink; 4: Potato, Milk, Eggs.; 5: Potato, Burger, cold drink, Milk eggs.

Q4 · a · 9 Marks

Explain the use of logistic function in logistic regression in detail.

Q4 · b · 9 Marks

Write short note on the following: i) Removing duplicates from data set. ii) Handling missing data iii) Data transformation.

Q5 · a · 9 Marks

Suppose that the given data the taste is to cluster points (With (x.y) representing location) into three cluster, where the points are. A1(2,10), A2(2,5), A3(8,4), B1 (5,8) B2(7,5) B3(6,4), C1(1,2), C2(4,9) The distance function is Euclidean distance suppose initially we assign A1, B1 and C1 as the center of each cluster, respectively. use the k- means algorithm to show only the three cluster centers after the first round of execution with steps.

Q5 · b · 8 Marks

Explain the following text analysis steps with suitable example. i) Part of speech (POS) tagging ii) Lemmatization iii) Stemming

Q6 · a · 8 Marks

Given the confusion matrix, calculate accuracy. precision, Recall, Error rate with description on heart attact risk. (Confusion matrix provided with Heart-Attack Risk-yes/No rows/columns)

Q6 · b · 9 Marks

Explain the TF/IDF (term frequency-inverse document frequency) terms in text analysis with suitable example.

Q7 · a · 9 Marks

List the data visualization tools and discuss any four applications of data visualization along with the use of the suitable plot.

Q7 · b · 9 Marks

List the challenges of data visualization explain the types of visualization with example.

Q8 · a · 9 Marks

Explain in detail the Hadoop Ecosystem with suitable diagram

Q8 · b · 9 Marks

Write a short note on the following i) Map reduce. ii) Pig iii) Hive

Subject Name	Data Science And Big Data Analytics
Semester	II
Pattern Year	2019
Subject Code	310251
Max Marks	70
Total Questions	8
Duration	2½ Hours
Paper Number	[5926]-65
Academic Year	T.E.
Branch Name	Computer Engg.
Exam Type	ENDSEM
Exam Session	2022 Nov Dec Endsem
Watermark	['CEGP013091', '49.248.216.238 14/01/2023 09:40:46 static-238']

Q1 · a · 8 Marks

Explain Data Analytics Cycle with suitable diagram and its phases.

Q1 · b · 9 Marks

List and Explain the various activities involved in identifying potential data resources as a part of discovery phase in Data Analytics Life Cycle?

Q2 · a · 8 Marks

List and explain the key roles for successful analytics project.

Q2 · b · 9 Marks

Write short note on : i) Common Tools for the Model Building ii) Model selection for Data Analytics

Q3 · a · 9 Marks

List and explain the various types of analytics in Big data.

Q3 · b · 9 Marks

Calculates the support and confidence value for all the possible item sets. Transaction ID Items bought 1 Onion, Potato, Cold Drink 2 Onion, Burger, Cold Drink 3 Eggs, Onion, Cold Drink 4 Potato, Milk, Eggs 5 Potato, Burger, Cold Drink, Milk, Eggs

Q4 · a · 9 Marks

Explain the need of logistic regression along with its various types.

Q4 · b · 9 Marks

Explain the following terms with suitable example. i) Removing Duplicates from dataset. ii) Handling Missing Data

Q5 · a · 8 Marks

Suppose that the given data the task is to cluster points (with (x, y) representing location) into three clusters, where the points are A1 (2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 9). The distance function is Euclidean distance. Suppose initially we assign A1, B1 and C1 as the center of each cluster, respectively. Use the k-means algorithm to show only show only the first round of execution with cluster center.

Q5 · b · 9 Marks

Explain the following Text Analysis steps with suitable example i) Part-of-speech(POS)tagging ii) Lemmatization

Q6 · a · 8 Marks

Given the confusion matrix, Calculate Accuracy, Precision, Recall, Error rate with description on Diabetic Risk. Predicted classes Classes Diabetic Risk -Yes Diabetic Risk -No Actual Diabetic Risk- Yes 90 210 Diabetic Risk- No 140 9560

Q6 · b · 9 Marks

Explain the Text Preprocessing steps with suitable example.

Q7 · a · 9 Marks

List the few data visualization tools and discuss any four applications of data visualization along with the use of the various plots with Python/R or suitable tool.

Q7 · b · 9 Marks

List the challenges of Data Visualization. Explain the types of visualization with example.

Q8 · a · 9 Marks

Explain in detail the Hadoop Ecosystem with suitable diagram along with the various components.

Q8 · b · 9 Marks

Write a short note on the following. a) Map Reduce b) Pig

Subject Name	Data Science And Big Data Analytics
Semester	II
Pattern Year	2019
Subject Code	310251
Max Marks	70
Total Questions	8
Duration	2½ Hours
Paper Number	[6180]-53
Academic Year	T.E. (Computer Engineering)
Branch Name	Computer Engineering
Exam Type	ENDSEM
Exam Session	2023 Nov Dec Endsem
Watermark	['CEGP013091', '49.248.216.238 12/12/2023 09:53:35 static-238']

Q1 · a · 6 Marks

Draw data analytics life cycle diagram and briefly explain its phases.

Q1 · b · 6 Marks

Explain the various key roles for a successful analytics project.

Q1 · c · 6 Marks

What are various sources of Big data.

Q2 · a · 6 Marks

Describe few applications of Big Data Analytics.

Q2 · b · 6 Marks

Explain Data preparation phase of data analytics lifecycle.

Q2 · c · 6 Marks

List common tools used for model building phase of data analytic.

Q3 · a · 9 Marks

Define and explain Entropy and Information gain. Calculate the entropy of the following distribution Fruit Color Taste Count Yellow Sweet 10 Red Sweet 5 Green sour 15 Orange sour 5

Q3 · b · 8 Marks

Explain Naïve Bayes Classifier.

Q4 · a · 9 Marks

Explain Apriori algorithm with suitable example.

Q4 · b · 8 Marks

Describe different categories of analytics

Q5 · a · 9 Marks

What is Hierarchical clustering? Explain hierarchical clustering algorithms.

Q5 · b · 8 Marks

Write a note on: i) Holdout method ii) k-Fold Cross-Validation

Q6 · a · 9 Marks

What is Text analysis? Explain the different steps involved in the text analysis.

Q6 · b · 8 Marks

Write a note on Social network analysis. What are the applications of Social network analysis?

Q7 · a · 9 Marks

Explain Hive architecture with suitable diagram. Describe characteristics and features of hive.

Q7 · b · 9 Marks

Describe the data visualization tool Tableau.

Q8 · a · 9 Marks

What is data visualization and objectives of data visualization? Why it is difficult visualize Big Data?

Q8 · b · 9 Marks

Write a note on Microsoft Power BI and Qlik

Subject Name	Data Science And Big Data Analytics
Semester	II
Pattern Year	2019
Subject Code	310251
Max Marks	70
Total Questions	8
Duration	2 ½Hours
Paper Number	[6353]-43
Academic Year	T.E.
Branch Name	Computer Engineering
Exam Type	ENDSEM
Exam Session	2024 Nov Dec Endsem
Watermark	['CEGP013091', '49.248.216.237 25/11/2024 09:43:38 static-237']

Q1 · a · 4 Marks

What are dimensionality reduction and its benefits?

Q1 · b · 5 Marks

What is data wrangling? Why do you need it?

Q1 · c · 6 Marks

What is regression? Explain different types of regression with example.

Q2 · a · 4 Marks

Differentiate between Data Science, Machine Learning and AI.

Q2 · b · 5 Marks

What does feature engineering typically includes?

Q2 · c · 6 Marks

What is Data Discretization, explain Forms of data discretization.

Q3 · a · 4 Marks

Write a short note on contingency table, explain with example.

Q3 · b · 5 Marks

With an example explain Baye's theorem. Also explain its key terms.

Q3 · c · 6 Marks

Is there a correlation between the variables in the following data set? Hours: 9, 15, 25, 14, 10, 18, 19, 16, 20, 18; Marks: 39, 56, 93, 61, 50, 75, 42, 70, 66, 32

Q4 · a · 4 Marks

What is population & how is it differ from a sample?

Q4 · b · 5 Marks

With an example, explain one-tailed & two-tailed t-tests.

Q4 · c · 6 Marks

Describe the Chi-Square Test of Independence.

Subject Name	Data Science And Big Data Analytics
Semester	II
Pattern Year	2019
Subject Code	310251
Max Marks	30
Total Questions	4
Duration	1 Hour
Paper Number	[6009]-322
Academic Year	T.E.
Branch Name	Computer Engineering
Exam Type	INSEM
Exam Session	2023 Feb Insem
Watermark	['CEGP013091', '49.248.216.238 03/04/2023 12:09:14 static-238']

Q1 · a · 5 Marks

Explain data wrangling methods with suitable example.

Q1 · b · 5 Marks

Suppose that the data for analysis includes the attribute age, given the following data (in increasing order) for the attribute age: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52,70. i) Use smoothing by bin means, using a bin depth of 3. ii) What other methods are there for data smoothing?

Q1 · c · 5 Marks

What is data science? Compare data science and information science.

Q2 · a · 5 Marks

Explain 5 V’s of Big Data.

Q2 · b · 5 Marks

Explain different phases of data analytics life cycle with neat diagram.

Q2 · c · 5 Marks

Compare Business Intelligence and data science.

Q3 · a · 5 Marks

Explain skewness and kurtosis. What is the purpose of finding skewness of data?

Q3 · b · 5 Marks

What is degree of freedom? Explain with example.

Q3 · c · 5 Marks

How hypothesis testing works? Explain steps.

Q4 · a · 5 Marks

List out measures of dispersion with their significance and mathematical formulae.

Q4 · b · 5 Marks

Describe Chi-square Goodness of Fit test.

Q4 · c · 5 Marks

Assume that a patient X took a lab test for a certain disease and tested positive. The lab test returns a positive result in 95% of the cases in which the disease is actually present and it falsely returns a positive result in 6% of the cases in which the disease is not present. Further more only 1% of the entire population has this disease. What is the probability that X actually has the disease given that he is tested positive.

Subject Name	Data Science And Big Data Analytics
Semester	II
Pattern Year	2019
Subject Code	310251
Max Marks	30
Total Questions	4
Duration	1 Hour
Paper Number	[6269]-317
Academic Year	T.E.
Branch Name	Computer Engineering
Exam Type	INSEM
Exam Session	2024 March Insem
Watermark	['CEGP013091', '49.248.216.238 20/03/2024 10:45:49 static-238']

Data Science And Big Data Analytics Question Papers - SPPU University

Available Data Science And Big Data Analytics Papers

INSEM Papers for Data Science And Big Data Analytics

ENDSEM Papers for Data Science And Big Data Analytics

Online PDF Viewer Features

About SPPU Question Papers Hub

Study Materials for Data Science And Big Data Analytics

Relevant Keywords & Topics

Data Science And Big Data Analytics

Data Science And Big Data Analytics Question Papers - SPPU University

Available Data Science And Big Data Analytics Papers

INSEM Papers for Data Science And Big Data Analytics

ENDSEM Papers for Data Science And Big Data Analytics

Online PDF Viewer Features

About SPPU Question Papers Hub

Study Materials for Data Science And Big Data Analytics

Relevant Keywords & Topics

Download Data Science And Big Data Analytics Papers

INSEM Papers

ENDSEM Papers

All Papers