What is driving data deluge? Explain with one example.
Please rotate your device horizontally for split view
Access and download Data Science And Big Data Analytics question papers from Savitribai Phule Pune University (SPPU). Our collection includes INSEM (Internal Semester) and ENDSEM (End Semester) exam papers.
We offer 9 question papers for Data Science And Big Data Analytics, covering various exam patterns and years. All papers are in PDF format for easy viewing and download.
Prepare for mid-term evaluations with Data Science And Big Data Analytics INSEM papers, aligned with the SPPU exam pattern and syllabus.
Access Data Science And Big Data Analytics ENDSEM papers covering the entire syllabus, essential for final exam preparation.
Our question-paper viewer enables you to:
SPPU Question Papers Hub is focused entirely on SPPU previous year papers, with cleaner discovery by branch, semester, and subject.
Data Science And Big Data Analytics is a key subject in the SPPU curriculum. Our question paper collection helps students understand exam patterns, practice effectively, and improve academic performance.
Explore Data Science And Big Data Analytics resources including SPPU question papers from Savitribai Phule Pune University. Find INSEM and ENDSEM papers for effective examination preparation. Our platform offers academic resources, a PDF viewer for online study, university question papers, and materials for semester examinations.
Download all INSEM question papers as ZIP
Download all ENDSEM question papers as ZIP
Download all question papers (INSEM + ENDSEM) as ZIP
What is driving data deluge? Explain with one example.
What is data science? Differentiate between Business Intelligence and Data Science.
What are the sources of Big Data. Explain model building phase with example.
Explain big data analytics architecture with diagram. What is data discovery phase. Explain with example.
Explain various data pre-processing steps. Discuss essential python libraries for preprocessing.
What are association rules? Explain Apriori Algorithm in brief.
Explain the following i) Linear Regression ii) Logistic Regression
Explain scikit-learn library for matplotlib with example.
Write short note on i) Time series Analysis ii) TF - IDF.
What is clustering? With suitable example explain the steps involved in k - means algorithm.
Write short note on i) Confusion matrix ii) AVC - ROC curve
Discuss Holdout method and Random Sub Sampling methods.
With a suitable example explain Histogram and explain its usages.
Describe the Data visualization tool “Tableau”. Explain its applications in brief.
With a suitable example explain and draw a Box plot and explain its usages.
Describe the challenges of data visualization. Draw box plot and explain its usages.
| Subject Name | Data Science And Big Data Analytics |
|---|---|
| Semester | II |
| Pattern Year | 2019 |
| Subject Code | 310251 |
| Max Marks | 70 |
| Total Questions | 8 |
| Duration | 2½ Hours |
| Paper Number | [5870] - 1133 |
| Academic Year | T.E. |
| Branch Name | Computer Engineering |
| Exam Type | ENDSEM |
| Exam Session | 2022 May Jun Endsem |
| Watermark | ['CEGP013091', '49.248.216.238 29/06/2022 08:36:57 static-238'] |
What is Model Building elaborate this phase of data analytics with the help of suitable example.
Explain any three sources of Big Data. Differentiate BI versus Data science.
What are the three characteristic of Big Data and what are the main consideration in processing Big Data.
Explain Descriptive, Diagnostic, Predictive analytics.
Explain why decision tree are used. Draw a sample decision tree and explain its parts.
How Apriori Algorithm works, explain with suitable example?
What is data preprocessing? Explain in details about handling missing data and transformation of data.
Explain Naïve Bayes’ classifier and it applications.
What is text processing? Explain TF-IDF with example.
With suitable example ,explain the steps involved in k-means algorithm.
Define following terms with respect to confusion matrix : i) Accuracy ii) Precision iii) Recall iv) AUC-ROC
Explain k-fold Cross Validation & Random Subsampling.
With a suitable example, draw a Histogram, boxplot and explain its usages.
Describe the data visualization tool Tableau. List of data visualization tools.
What is Data Visualization? Describe the challenges of data visualization.
Explain architecture of Apache-Pig.
| Subject Name | Data Science And Big Data Analytics |
|---|---|
| Semester | II |
| Pattern Year | 2019 |
| Subject Code | 310251 |
| Max Marks | 70 |
| Total Questions | 8 |
| Duration | 2½ Hours |
| Paper Number | [6003]-354 |
| Academic Year | T.E. |
| Branch Name | Computer Engineering |
| Exam Type | ENDSEM |
| Exam Session | 2023 May Jun Endsem |
| Watermark | ['CEGP013091', '49.248.216.238 21/06/2023 10:32:30 static-238'] |
What is the data Preparation phase in Data Analytics Lifecycle. What is the Analytics Sandbox and ETLT process in this phase?
List out different stakeholders of an analytics project. What they usually expect at the conclusion (key outputs) of a project?
List out the activities to be carried out in model planning and model building phase. What are different tools used for these phases?
What is linear regression, and what are its primary objectives? What is the difference between simple linear regression and multiple linear regression? How do you evaluate the performance of linear regression?
What is logistic regression, and how does it differ from linear regression? What is the sigmoid function, and what role does it play in logistic regression?
Suppose you are given a dataset containing information about whether emails are spam or not spam, along with two features: the presence of the word "offer" (1 for present, 0 for absent) and the presence of the word "free" (1 for present, 0 for absent). You are tasked with classifying a new email with the following feature values: "offer"=1 and "free"=1. Given the training dataset: Email Offer Free Spam 1 1 0 No 2 0 1 Yes 3 1 1 Yes 4 0 1 No 5 1 1 Yes Calculate the probability that the new email is spam using Naive Bayes.
How does the Apriori algorithm discover frequent itemsets in a dataset? What is the role of support and confidence in the context of association rule mining using the Apriori algoritm?
Explain the process of building a decision tree? What are the criteria used for splitting nodes in a decision tree?
Suppose you have the following dataset containing the coordinates of points in a 2-dimensional space: Point X Coordinate Y Coordinate A 2 3 B 4 7 C 3 5 D 6 9 E 8 6 F 7 8 Perform K-means clustering on this dataset with K = 2. Assume the initial centroids to be (2,3) and (8,6). Compute the new centroids after each iteration until convergence, and assign points to their nearest centroids.
How do you handle noise and irrelevant information in text data during preprocessing? Explain the terms bag of words and TF IDF in text analytics.
Explain how hierarchical clustering can be used for visualizing hierarchical relationships in data with suitable example? What are some real-world applications of hierarchical clustering?
What is the holdout method, and how does it work? Explain the difference between training set, validation set, and test set in the holdout method.
What is a histogram? How is it used to visualize the distribution of data? How is it different from a density plot?
What is the Hadoop ecosystem, and what are its primary components? What is MapReduce, and how does it fit into the Hadoop ecosystem?
What is a box plot? Explain the different components of a box plot? How do you interpret the median, quartiles, and whiskers in a box plot? What does the interquartile range (IQR) represent in a box plot?
Explain the role of Apache Pig in data processing workflows on Hadoop? What is Apache Spark, and how does it complement Hadoop for big data processing?
| Subject Name | Data Science And Big Data Analytics |
|---|---|
| Semester | II |
| Pattern Year | 2019 |
| Subject Code | 310251 |
| Max Marks | 70 |
| Total Questions | 8 |
| Duration | 2½ Hours |
| Paper Number | [6262]-43 |
| Academic Year | T.E. |
| Branch Name | Computer Engineering |
| Exam Type | ENDSEM |
| Exam Session | 2024 May Jun Endsem |
| Watermark | ['CEGP013091', '49.248.216.238 15/05/2024 09:42:06 static-238'] |
What is Model Building elaborate this phase of data analytics with the help of a suitable example?
List out different stakeholders of an analytics project. What do they usually expect at the conclusion (key outputs) of a project?
Explain Descriptive, Diagnostic, Predictive analytics.
List and explain the various activities involved in identifying potential data resources as a part of discovery phase in Data Analytics Life Cycle?
What is association rule mining? Describe the working of the Apriori algorithm with an example.
Explain how decision trees are constructed using information gain and entropy. Illustrate with a small example.
Explain Naïve Bayes’ classifier and its applications.
Consider a dataset with binary classes and two features: “Loan Amount” and “Default History.” Show how logistic regression could be applied for loan default prediction.
Explain the holdout method. Differentiate training set, validation set, and test set.
Given the confusion matrix below. Calculate Accuracy, Precision, Recall and F1-score. Predicted Yes Predicted No Actual Yes 70 30 Actual No 20 80
Explain the following Text Analysis steps with suitable example i) Part-of-speech (POS) tagging ii) Lemmatization
Use K-Means Clustering for the following points and determine the centroids after one iteration. Assume initial centroids as A(1,l), B(5,7). Points: (1,2), (2,1), (3,5), (6,8), (7,6), (5,5)
Explain Hadoop Architecture with a neat diagram. Highlight the roles of NameNode and DataNode.
Compare Tableau, Power BI, and Matplotlib for data visualization. Discuss scenarios where each tool is best suited.
What is Data Visualization? Describe the challenges of data visualization.
Write short notes on the following : i) Map Reduce ii) HDFS iii) Hive
| Subject Name | Data Science And Big Data Analytics |
|---|---|
| Semester | VI |
| Pattern Year | 2019 |
| Subject Code | 310251 |
| Max Marks | 70 |
| Total Questions | 8 |
| Duration | 2½ Hours |
| Paper Number | [6403]-43 |
| Academic Year | T.E. |
| Branch Name | Computer Engineering |
| Exam Type | ENDSEM |
| Exam Session | 2025 May Jun Endsem |
| Watermark | ['CEGP013091', '49.248.216.237 26/05/2025 09:37:42 static-237'] |
Draw the diagram of data analytics life cycle in big data and briefly explain its phases.
Explain in detail how the model building phase is built by team in data analytics life cycle?
List and explain the steps in data preparation phase of data analytics life cycle.
Write short note on the following: i) ETL ii) Common tools for the model building. iii) Model selection for data analytics.
What are the types of analytics in big data? Explain in brief.
Calculate the support and confidence value for all the possible item sets. Transaction ID 1: Onion, Potato, Cold drink; 2: Onion, Burger, Cold drink; 3: Eggs, Onion, Cold drink; 4: Potato, Milk, Eggs.; 5: Potato, Burger, cold drink, Milk eggs.
Explain the use of logistic function in logistic regression in detail.
Write short note on the following: i) Removing duplicates from data set. ii) Handling missing data iii) Data transformation.
Suppose that the given data the taste is to cluster points (With (x.y) representing location) into three cluster, where the points are. A1(2,10), A2(2,5), A3(8,4), B1 (5,8) B2(7,5) B3(6,4), C1(1,2), C2(4,9) The distance function is Euclidean distance suppose initially we assign A1, B1 and C1 as the center of each cluster, respectively. use the k- means algorithm to show only the three cluster centers after the first round of execution with steps.
Explain the following text analysis steps with suitable example. i) Part of speech (POS) tagging ii) Lemmatization iii) Stemming
Given the confusion matrix, calculate accuracy. precision, Recall, Error rate with description on heart attact risk. (Confusion matrix provided with Heart-Attack Risk-yes/No rows/columns)
Explain the TF/IDF (term frequency-inverse document frequency) terms in text analysis with suitable example.
List the data visualization tools and discuss any four applications of data visualization along with the use of the suitable plot.
List the challenges of data visualization explain the types of visualization with example.
Explain in detail the Hadoop Ecosystem with suitable diagram
Write a short note on the following i) Map reduce. ii) Pig iii) Hive
| Subject Name | Data Science And Big Data Analytics |
|---|---|
| Semester | II |
| Pattern Year | 2019 |
| Subject Code | 310251 |
| Max Marks | 70 |
| Total Questions | 8 |
| Duration | 2½ Hours |
| Paper Number | [5926]-65 |
| Academic Year | T.E. |
| Branch Name | Computer Engg. |
| Exam Type | ENDSEM |
| Exam Session | 2022 Nov Dec Endsem |
| Watermark | ['CEGP013091', '49.248.216.238 14/01/2023 09:40:46 static-238'] |
Explain Data Analytics Cycle with suitable diagram and its phases.
List and Explain the various activities involved in identifying potential data resources as a part of discovery phase in Data Analytics Life Cycle?
List and explain the key roles for successful analytics project.
Write short note on : i) Common Tools for the Model Building ii) Model selection for Data Analytics
List and explain the various types of analytics in Big data.
Calculates the support and confidence value for all the possible item sets. Transaction ID Items bought 1 Onion, Potato, Cold Drink 2 Onion, Burger, Cold Drink 3 Eggs, Onion, Cold Drink 4 Potato, Milk, Eggs 5 Potato, Burger, Cold Drink, Milk, Eggs
Explain the need of logistic regression along with its various types.
Explain the following terms with suitable example. i) Removing Duplicates from dataset. ii) Handling Missing Data
Suppose that the given data the task is to cluster points (with (x, y) representing location) into three clusters, where the points are A1 (2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 9). The distance function is Euclidean distance. Suppose initially we assign A1, B1 and C1 as the center of each cluster, respectively. Use the k-means algorithm to show only show only the first round of execution with cluster center.
Explain the following Text Analysis steps with suitable example i) Part-of-speech(POS)tagging ii) Lemmatization
Given the confusion matrix, Calculate Accuracy, Precision, Recall, Error rate with description on Diabetic Risk. Predicted classes Classes Diabetic Risk -Yes Diabetic Risk -No Actual Diabetic Risk- Yes 90 210 Diabetic Risk- No 140 9560
Explain the Text Preprocessing steps with suitable example.
List the few data visualization tools and discuss any four applications of data visualization along with the use of the various plots with Python/R or suitable tool.
List the challenges of Data Visualization. Explain the types of visualization with example.
Explain in detail the Hadoop Ecosystem with suitable diagram along with the various components.
Write a short note on the following. a) Map Reduce b) Pig
| Subject Name | Data Science And Big Data Analytics |
|---|---|
| Semester | II |
| Pattern Year | 2019 |
| Subject Code | 310251 |
| Max Marks | 70 |
| Total Questions | 8 |
| Duration | 2½ Hours |
| Paper Number | [6180]-53 |
| Academic Year | T.E. (Computer Engineering) |
| Branch Name | Computer Engineering |
| Exam Type | ENDSEM |
| Exam Session | 2023 Nov Dec Endsem |
| Watermark | ['CEGP013091', '49.248.216.238 12/12/2023 09:53:35 static-238'] |
Draw data analytics life cycle diagram and briefly explain its phases.
Explain the various key roles for a successful analytics project.
What are various sources of Big data.
Describe few applications of Big Data Analytics.
Explain Data preparation phase of data analytics lifecycle.
List common tools used for model building phase of data analytic.
Define and explain Entropy and Information gain. Calculate the entropy of the following distribution Fruit Color Taste Count Yellow Sweet 10 Red Sweet 5 Green sour 15 Orange sour 5
Explain Naïve Bayes Classifier.
Explain Apriori algorithm with suitable example.
Describe different categories of analytics
What is Hierarchical clustering? Explain hierarchical clustering algorithms.
Write a note on: i) Holdout method ii) k-Fold Cross-Validation
What is Text analysis? Explain the different steps involved in the text analysis.
Write a note on Social network analysis. What are the applications of Social network analysis?
Explain Hive architecture with suitable diagram. Describe characteristics and features of hive.
Describe the data visualization tool Tableau.
What is data visualization and objectives of data visualization? Why it is difficult visualize Big Data?
Write a note on Microsoft Power BI and Qlik
| Subject Name | Data Science And Big Data Analytics |
|---|---|
| Semester | II |
| Pattern Year | 2019 |
| Subject Code | 310251 |
| Max Marks | 70 |
| Total Questions | 8 |
| Duration | 2 ½Hours |
| Paper Number | [6353]-43 |
| Academic Year | T.E. |
| Branch Name | Computer Engineering |
| Exam Type | ENDSEM |
| Exam Session | 2024 Nov Dec Endsem |
| Watermark | ['CEGP013091', '49.248.216.237 25/11/2024 09:43:38 static-237'] |
What are dimensionality reduction and its benefits?
What is data wrangling? Why do you need it?
What is regression? Explain different types of regression with example.
Differentiate between Data Science, Machine Learning and AI.
What does feature engineering typically includes?
What is Data Discretization, explain Forms of data discretization.
Write a short note on contingency table, explain with example.
With an example explain Baye's theorem. Also explain its key terms.
Is there a correlation between the variables in the following data set? Hours: 9, 15, 25, 14, 10, 18, 19, 16, 20, 18; Marks: 39, 56, 93, 61, 50, 75, 42, 70, 66, 32
What is population & how is it differ from a sample?
With an example, explain one-tailed & two-tailed t-tests.
Describe the Chi-Square Test of Independence.
| Subject Name | Data Science And Big Data Analytics |
|---|---|
| Semester | II |
| Pattern Year | 2019 |
| Subject Code | 310251 |
| Max Marks | 30 |
| Total Questions | 4 |
| Duration | 1 Hour |
| Paper Number | [6009]-322 |
| Academic Year | T.E. |
| Branch Name | Computer Engineering |
| Exam Type | INSEM |
| Exam Session | 2023 Feb Insem |
| Watermark | ['CEGP013091', '49.248.216.238 03/04/2023 12:09:14 static-238'] |
Explain data wrangling methods with suitable example.
Suppose that the data for analysis includes the attribute age, given the following data (in increasing order) for the attribute age: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52,70. i) Use smoothing by bin means, using a bin depth of 3. ii) What other methods are there for data smoothing?
What is data science? Compare data science and information science.
Explain 5 V’s of Big Data.
Explain different phases of data analytics life cycle with neat diagram.
Compare Business Intelligence and data science.
Explain skewness and kurtosis. What is the purpose of finding skewness of data?
What is degree of freedom? Explain with example.
How hypothesis testing works? Explain steps.
List out measures of dispersion with their significance and mathematical formulae.
Describe Chi-square Goodness of Fit test.
Assume that a patient X took a lab test for a certain disease and tested positive. The lab test returns a positive result in 95% of the cases in which the disease is actually present and it falsely returns a positive result in 6% of the cases in which the disease is not present. Further more only 1% of the entire population has this disease. What is the probability that X actually has the disease given that he is tested positive.
| Subject Name | Data Science And Big Data Analytics |
|---|---|
| Semester | II |
| Pattern Year | 2019 |
| Subject Code | 310251 |
| Max Marks | 30 |
| Total Questions | 4 |
| Duration | 1 Hour |
| Paper Number | [6269]-317 |
| Academic Year | T.E. |
| Branch Name | Computer Engineering |
| Exam Type | INSEM |
| Exam Session | 2024 March Insem |
| Watermark | ['CEGP013091', '49.248.216.238 20/03/2024 10:45:49 static-238'] |
Select a question to generate an answer