Data Science Interview Questions: Part 1

What is the difference between supervised learning and unsupervised learning?
- In supervised learning, we deal with labeled data. We know the correct values before training our model. Examples are regression and classification. Before training our model, we do not know the correct values in unsupervised learning. An example of this is clustering.

Supervised Learning	Unsupervised Learning
Input data is labeled.	Input data is unlabelled.
It uses a training data set.	It uses the input data set.
Used for prediction.	Used for analysis.
Enables classification and regression.	Enables Classification, Density Estimation, & Dimension Reduction

What are the types of biases that can occur during sampling?
1. Selection bias
2. Undercoverage bias
3. Survivorship bias

What is Selection Bias?
- Selection bias occurs when the researcher decides who/what will be studied. The error is commonly associated with research that isn’t random; it is also referred to as the selection effect. It is the distortion of statistical analysis resulting from the collection method. If the selection bias is not considered, then some conclusions of the study may not be accurate.

What is survivorship bias?
- Survivorship bias is the logical error of focusing on aspects that support surviving a process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous ways.

Why is R used in Data Visualization?
- R is used in data visualization as it has many inbuilt functions and libraries that help in data visualization. These libraries include ggplot2, leaflet, lattice, etc. R helps in exploratory data analysis as well as feature engineering. Using R, almost any type of graph can be created. Customizing graphics is easier in R than using python.

What makes a good Data Scientist?
- Your response to this question will tell a hiring manager a lot about how you see your role and the value you bring to an organization. In your answer, you could talk about how data science requires a rare combination of competencies and skills. A good Data Scientist needs to combine the technical skill needed to parse data and create models with the business sense necessary to understand the problems they’re tackling as well as recognize actionable insights in their data.

What is the purpose of PYTHONPATH environment variable?
- PYTHONPATH has a role similar to PATH. This variable tells Python Interpreter where to locate the module files imported into a program. It should include the Python source library directory and the directories containing Python source code. PYTHONPATH is sometimes preset by Python Installer.

Do you recall a situation when you had to clean and organize a big data set?
- Studies have shown that Data Scientists spend most of their time on data preparation rather than data mining or modeling. Data cleaning is also one of the most important steps for any company. The steps you follow in data preparation:
  1. Removing duplicate observations
  2. Fixing structural errors
  3. Filtering outliers
  4. Tackling missing data
  5. Data validation

Explain the steps in making a decision tree.
1. Take the entire data set as input
2. Calculate the entropy of the target variable, as well as the predictor attributes
3. Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
4. Choose the attribute with the highest information gain as the root node
5. Repeat the same procedure on every branch until the decision node of each branch is finalized

How do you build a random forest model?
- A random forest is built up of many decision trees. If you split the data into different packages and make a decision tree in each data group, the random forest brings all those trees together.

How is Data Science different from traditional application programming?
- The primary and vital difference between Data Science and traditional application programming is that in traditional programming, one has to create rules to translate the input into output. In Data Science, the rules are automatically produced from the data.

What is root cause analysis?
- Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique to isolate the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from recurring.

What is the difference between an error and a residual error?
- An error refers to the difference between predicted and actual values. The most popular means for calculating errors in data science are Mean Absolute Error(MAE), Mean Squared Error(MSE), and Root Mean Squared Error(RMSE). While residual is the difference between a group of values observed and their arithmetical mean. An error is generally unobservable, while a residual error can be visualized on a graph. Error represents how observed data differs from the actual population. While a residual represents the way observed data differs from the sample population data.

Blog