Blog

Blog

Top 50+ Pandas Interview Questions in 2023

Pandas Interview Questions

1. Define the Pandas/Python pandas?

Pandas is defined as an open-source library that provides high-performance data manipulation in Python. The name of Pandas is derived from the word Panel Data, which means an Econometrics from Multidimensional data. It can be used for data analysis in Python and was developed by Wes McKinney in 2008. It can perform five significant steps that are required for processing and analysis of data irrespective of the origin of the data, i.e., load, manipulate, prepare, model, and analyze.

2. What is Python Panda used for?

Pandas is a data manipulation and analysis software library for the Python programming language. It includes data structures and methods for manipulating numerical tables and time series, in particular. Pandas is open-source software licensed under the BSD three-clause license.

3. Which are the different types of Data Structures in Pandas?

Pandas provide two data structures, which are supported by the pandas library, Series,andDataFrames. Both of these data structures are built on top of the NumPy.

A Series is a one-dimensional data structure in pandas, whereas the DataFrame is the two-dimensional data structure in pandas.

4. What is a Series in Pandas?

A Series is defined as a one-dimensional array that is capable of storing various data types. The row labels of series are called the index. By using a ‘series‘ method, we can easily convert the list, tuple, and dictionary into series. A Series cannot contain multiple columns.

5. How can we calculate the standard deviation from the Series?

The Pandas std() is defined as a function for calculating the standard deviation of the given set of numbers, DataFrame, column, and rows.

 Series.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)

6. Explain about the operation on Series in Pandas?

The Pandas Series is a one-dimensional classified array that may hold any type of data (python objects, strings, integers, floating-point numbers, etc.). The axis identifiers are referred to as an index. The Pandas Series is merely a column in an excel spreadsheet.

Putting Together a Pandas Series

A Pandas Series is built in the real world by loading datasets from existing storage, which can be a SQL database, a CSV file, or an Excel file. Pandas Series can be made from lists, dictionaries, and other things. A series can be developed in a number of ways; here are a few examples: cheval cheval cheval cheval cheval cheval cheval cheval cheval cheval cheval cheval cheval cheval cheval cheval cheval

Creating a series from an array: To construct a series from an array, we must first load a NumPy module and then use its array() functions.

# import pandas as pd
import pandas as pd
 
# import numpy as np
import numpy as np
 
# simple array
data = np.array([‘D’,’A’,’T’,’A’,’C’,’A’,’D’,’E’,’M’,'Y'])
 
ser = pd.Series(data)
print(ser)

Output:

Q6

7. Give a brief description about time series in Panda?

A time series is an organized collection of data that depicts the evolution of a quantity through time. Pandas have a wide range of capabilities and tools for working with time-series data in all fields.

Supported by pandas– Pandas Interview Questions

  • Analyzing time-series data from a variety of sources and formats
  • Create time and date sequences with preset frequencies.
  • Date and time manipulation and conversion with time zone information
  • A time series is resampled or converted to a specific frequency.
  • Calculating dates and times using absolute or relative time increments is one way to

8. What are the significant features of the pandas Library?

The key features of the panda’s library are as follows:

  • Memory Efficient
  • Data Alignment
  • Reshaping
  • Merge and join
  • Time Series

9. What is the name of Pandas library tools used to create a scatter plot matrix?

Scatter_matrix

10. Explain Reindexing in pandas?

Reindexing is used to conform DataFrame to a new index with optional filling logic. It places NA/NaN in that location where the values are not present in the previous index. It returns a new object unless the new index is produced as equivalent to the current one, and the value of copy becomes False. It is used to change the index of the rows and columns of the DataFrame.

11. What is DataFrame in Pandas?

A DataFrame is a widely used data structure of pandas and works with a two-dimensional array with labeled axes (rows and columns) DataFrame is defined as a standard way to store data and has two different indexes, i.e., row index and column index. It consists of the following properties:

  • The columns can be heterogeneous types like int and bool.
  • It can be seen as a dictionary of Series structure where both the rows and columns are indexed. It is denoted as “columns” in the case of columns and “index” in case of rows.

12. Explain Categorical data in Pandas?

Categorical data is defined as a Pandas data type that corresponds to a categorical variable in statistics. A categorical variable is generally used to take a limited and usually fixed number of possible values. Examples: gender, country affiliation, blood type, social class, observation time, or rating via Likert scales. All values of categorical data are either in categories or np.nan.

This data type is useful in the following cases:

  • It is useful for a string variable that consists of only a few different values. If we want to save some memory, we can convert a string variable to a categorical variable.
  • It is useful for the lexical order of a variable that is not the same as the logical order (?one?, ?two?, ?three?) By converting into a categorical and specify an order on the categories, sorting and min/max is responsible for using the logical order instead of the lexical order.
  • It is useful as a signal to other Python libraries because this column should be treated as a categorical variable.

 

13. Define the different ways a DataFrame can be created in pandas?

We can create a DataFrame using following ways:

  • Lists
  • Dict of ndarrays

Example-1: Create a DataFrame using List:

1.     import pandas as pd    
2.     # a list of strings    
3.     a = ['Python', 'Pandas']    
4.     # Calling DataFrame constructor on list    
5.     info = pd.DataFrame(a)    
6.     print(info)    

Output:

Q13

Example-2: Create a DataFrame from dict of ndarrays:

1.    import pandas as pd    
2.    info = {'ID' :[101, 102, 103],'Department' :['B.Sc','B.Tech','M.Tech',]}    
3.    info = pd.DataFrame(info)    
4.    print (info)   

 Output:

Q13a

14. How will you create a series from dict in Pandas?

A Series is defined as a one-dimensional array that is capable of storing various data types.

We can create a Pandas Series from Dictionary:

Create a Series from dict:

We can also create a Series from dict. If the dictionary object is being passed as an input and the index is not specified, then the dictionary keys are taken in a sorted order to construct the index.

If index is passed, then values correspond to a particular label in the index will be extracted from the dictionary.

1.    import pandas as pd    
2.    import numpy as np    
3.    info = {'x' : 0., 'y' : 1., 'z' : 2.}    
4.    a = pd.Series(info)    
5.    print (a)    

Output:

Q14

15. How can we create a copy of the series in Pandas?

We can create the copy of series by using the following syntax:

  • pandas.Series.copySeries.copy(deep=True)

The above statements make a deep copy that includes a copy of the data and the indices. If we set the value of deep to False, it will neither copy the indices nor the data.

Note: If we set deep=True, the data will be copied, and the actual python objects will not be copied recursively, only the reference to the object will be copied.

16. Characterize the Data Frames in Pandas?

A DataFrame is a panda-specific lewis structure that functions with a two-dimensional display with tomahawks (rows and columns). A DataFrame is a typical way of storing data that has two separate indices, namely a row index and a column index. It includes the following characteristics:

Columns such as int and bool are heterogeneous.

It’s commonly thought of as a term reference for a series structure that includes both rows and columns. If there are columns, it is denoted as “columns,” and if there are lines, it is denoted as “index.”

Syntax:

import pandas as pd

df=pd.Dataframe()

17. How will you create an empty DataFrame in Pandas?

A DataFrame is a widely used data structure of pandas and works with a two-dimensional array with labeled axes (rows and columns) It is defined as a standard way to store data and has two different indexes, i.e., row index and column index.

Create an empty DataFrame:

The below code shows how to create an empty DataFrame in Pandas:

1.     # importing the pandas library    
2.     import pandas as pd    
3.     info = pd.DataFrame()    
4.     print (info)    

Output:

Q17

18. How will you add a column to a pandas DataFrame?

We can add any new column to an existing DataFrame. The below code demonstrates how to add any new column to an existing DataFrame:

# importing the pandas library    
import pandas as pd      
info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),    
             'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}       
info = pd.DataFrame(info)    
# Add a new column to an existing DataFrame object     
print ("Add new column by passing series")    
info['three']=pd.Series([20,40,60],index=['a','b','c'])    
print (info)    
print ("Add new column using existing DataFrame columns")    
info['four']=info['one']+info['three']    
print (info)  

Output:

Q18

19. How to add an Index, row, or column to a Pandas DataFrame?

Adding an Index to a DataFrame

Pandas allow adding the inputs to the index argument if you create a DataFrame. It will make sure that you have the desired index. If you don?t specify inputs, the DataFrame contains, by default, a numerically valued index that starts with 0 and ends on the last row of the DataFrame.

Adding Rows to a DataFrame

We can use .loc, iloc, and ix to insert the rows in the DataFrame.

  • The loc basically works for the labels of our index. It can be understood as if we insert in loc[4], which means we are looking for that values of DataFrame that have an index labeled 4.
  • The iloc basically works for the positions in the index. It can be understood as if we insert in iloc[4], which means we are looking for the values of DataFrame that are present at index ‘4`.
  • The ix is a complex case because if the index is integer-based, we pass a label to ix. The ix[4] means that we are looking in the DataFrame for those values that have an index labeled 4. However, if the index is not only integer-based, ix will deal with the positions as iloc.

Adding Columns to a DataFrame

If we want to add the column to the DataFrame, we can easily follow the same procedure as adding an index to the DataFrame by using loc or iloc.

20. Tell us now how to retrieve a single column from a Panda Dataframe?

Use the query $django-admin.py to start a Django project, and then use the following queries:

Project

_init_.py

manage.py

settings.py

urls.py

21. How to Delete Indices, Rows or Columns From a Pandas Data Frame?

Deleting an Index from Your DataFrame

If you want to remove the index from the DataFrame, you should have to do the following:

Reset the index of DataFrame.

Executing del df.index.name to remove the index name.

Remove duplicate index values by resetting the index and drop the duplicate values from the index column.

Remove an index with a row.

Deleting a Column from Your DataFrame

You can use the drop() method for deleting a column from the DataFrame.

The axis argument that is passed to the drop() method is either 0 if it indicates the rows and 1 if it drops the columns.

You can pass the argument inplace and set it to True to delete the column without reassign the DataFrame.

You can also delete the duplicate values from the column by using the drop_duplicates() method.

Removing a Row from Your DataFrame

By using df.drop_duplicates(), we can remove duplicate rows from the DataFrame.

You can use the drop() method to specify the index of the rows that we want to remove from the DataFrame.

22. How To Select an Index or Column From a Pandas DataFrame?

Before you start with adding, deleting and renaming the components of your DataFrame, you first need to know how you can select these elements.

So, how do you do this?

Well, in essence, selecting an index, column or value from your DataFrame isn’t that hard. It’s really very similar to what you see in other languages that are used for data analysis (and which you might already know!).

Let’s take R for example. You use the [,] notation to access the data frame’s values. In Pandas DataFrames, this is not too much different: the most important constructions to use are, without a doubt, loc and iloc. The subtle differences between these two will be discussed in the next sections. For now, it suffices to know that you can either access the values by calling them by their label or by their position in the index or column.

23. How to Rename the Index or Columns of a Pandas DataFrame?

You can use the .rename method to give different values to the columns or the index values of DataFrame.

24. How will you combine different Data Frames in Panda?

Following are the ways to combine different Data Frames in panda:

-> append() method:This is used to horizontally stack the dataframes.

Syntax: df1.append(df2)

-> concat() method: This is used to sequentially stack data frames. This works best because the data frames have the same fields and columns.

Syntax: pd.concat([df1, df2]) 

-> join() method: This is used to extract data from different dataframes that have one or more common columns.

Syntax: df1.join(df2)

25. How to iterate over a Pandas DataFrame?

You can iterate over the rows of the DataFrame by using for loop in combination with an iterrows() call on the DataFrame.

26. How to get the items of series A not present in series B?

We can remove items present in p2 from p1 using isin() method.

1.    import pandas as pd  
2.    p1 = pd.Series([2, 4, 6, 8, 10])  
3.    p2 = pd.Series([8, 10, 12, 14, 16])  
4.    p1[~p1.isin(p2)]  Solution

Output:

Q26

27. How to get the items not common to both series A and series B?

We get all the items of p1 and p2 not common to both using below example:

1.	import pandas as pd  
2.	import numpy as np  
3.	p1 = pd.Series([2, 4, 6, 8, 10])  
4.	p2 = pd.Series([8, 10, 12, 14, 16])  
5.	p1[~p1.isin(p2)]  
6.	p_u = pd.Series(np.union1d(p1, p2))  # union  
7.	p_i = pd.Series(np.intersect1d(p1, p2))  # intersect  
8.    p_u[~p_u.isin(p_i)]  

Output:

Q27

28. How to get the minimum, 25th percentile, median, 75th, and max of a numeric series?

We can compute the minimum, 25th percentile, median, 75th, and maximum of p as below example:

1.	import pandas as pd  
2.	import numpy as np  
3.	p = pd.Series(np.random.normal(14, 6, 22))  
4.	state = np.random.RandomState(120)  
5.	p = pd.Series(state.normal(14, 6, 22))  
6.    np.percentile(p, q=[0, 25, 50, 75, 100])  

Output:

Q28

29. How to get frequency counts of unique items of a series?

We can calculate the frequency counts of each unique value p as below example:

1.    import pandas as pd  
2.    import numpy as np  
3.    p= pd.Series(np.take(list('pqrstu'), np.random.randint(6, size=17)))  
4.    p = pd.Series(np.take(list('pqrstu'), np.random.randint(6, size=17)))  
5.    p.value_counts()  

Output:

Q29

30. How to convert a numpy array to a dataframe of given shape?

We can reshape the series p into a dataframe with 6 rows and 2 columns as below example:

1.	import pandas as pd  
2.	import numpy as np  
3.	p = pd.Series(np.random.randint(1, 7, 35))  
4.	# Input  
5.	p = pd.Series(np.random.randint(1, 7, 35))  
6.	info = pd.DataFrame(p.values.reshape(7,5))  
7.  print(info)  

Output:

Q30

31. How can we convert a Series to DataFrame?

The Pandas Series.to_frame() function is used to convert the series object to the DataFrame.

Series.to_frame(name=None)  

name: Refers to the object. Its Default value is None. If it has one value, the passed name will be substituted for the series name.

s = pd.Series(["a", "b", "c"],    
name="vals")    
s.to_frame()    

Output:

Q31

32. What is Pandas NumPy array?

Numerical Python (Numpy) is defined as a Python package used for performing the various numerical computations and processing of the multidimensional and single-dimensional array elements. The calculations using Numpy arrays are faster than the normal Python array.

33. Define ReIndexing?

Reindexing is used to change the index of the rows and columns of the DataFrame. We can reindex the single or multiple rows by using the reindex() method. Default values in the new index are assigned NaN if it is not present in the DataFrame.

DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=nan, limit=None, tolerance=None)

34. How can we convert DataFrame into a NumPy array?

For performing some high-level mathematical functions, we can convert Pandas DataFrame to numpy arrays. It uses the DataFrame.to_numpy() function.

The DataFrame.to_numpy() function is applied to the DataFrame that returns the numpy ndarray.

DataFrame.to_numpy(dtype=None, copy=False)

35. How can we convert DataFrame into an excel file?

We can export the DataFrame to the excel file by using the to_excel() function. To write a single object to the excel file, we have to specify the target file name. If we want to write to multiple sheets, we need to create an ExcelWriter object with target filename and also need to specify the sheet in the file in which we have to write.

36. What is Time Offset?

The offset specifies a set of dates that conform to the DateOffset. We can create the DateOffsets to move the dates forward to valid dates.

37. How can we sort the DataFrame?

We can efficiently perform sorting in the DataFrame through different kinds:

  • By label
  • By Actual value

By label

The DataFrame can be sorted by using the sort_index() method. It can be done by passing the axis arguments and the order of sorting. The sorting is done on row labels in ascending order by default.

By Actual Value

It is another kind through which sorting can be performed in the DataFrame. Like index sorting, sort_values() is a method for sorting the values.

It also provides a feature in which we can specify the column name of the DataFrame with which values are to be sorted. It is done by passing the ‘by‘ argument.

38. What is the Time Series in Pandas?

The Time series data is defined as an essential source for information that provides a strategy that is used in various businesses. From the conventional finance industry to the education industry, it consists of a lot of details about the time.

Time series forecasting is the machine learning modeling that deals with the Time Series data for predicting future values through Time Series modeling.

39. Define Time Periods?

The Time Periods represent the time span, e.g., days, years, quarter or month, etc. It is defined as a class that allows us to convert the frequency to the periods.

40. What is Data Aggregation?

The main task of Data Aggregation is to apply some aggregation to one or more columns. It uses the following:

  • sum: It is used to return the sum of the values for the requested axis.
  • min: It is used to return a minimum of the values for the requested axis.
  • max: It is used to return a maximum value for the requested axis.

41. What is the Pandas Index?

Pandas Index is defined as a vital tool that selects particular rows and columns of data from a DataFrame. Its task is to organize the data and to provide fast access to data. It can also be called a Subset Selection.

42. Define Multiple Indexing?

Multiple indexing is defined as essential indexing because it deals with data analysis and manipulation, especially for working with higher dimensional data. It also enables us to store and manipulate data with the arbitrary number of dimensions in lower-dimensional data structures like Series and DataFrame.

43. How to Set the index?

We can set the index column while making a data frame. But sometimes, a data frame is made from two or more data frames, and then the index can be changed using this method.

44. How to Reset the index?

The Reset index of the DataFrame is used to reset the index by using the ‘reset_index‘ command. If the DataFrame has a MultiIndex, this method can remove one or more levels.

45. How to convert String to date?

The below code demonstrates how to convert the string to date:

from datetime import datetime

# Define dates as the strings       
dmy_str1 = 'Wednesday, July 14, 2018'    
dmy_str2 = '14/7/17'    
dmy_str3 = '14-07-2017'    
# Define dates as the datetime objects    
dmy_dt1 = datetime.strptime(date_str1, '%A, %B %d, %Y')    
dmy_dt2 = datetime.strptime(date_str2, '%m/%d/%y')    
dmy_dt3 = datetime.strptime(date_str3, '%m-%d-%Y')     
#Print the converted dates    
print(dmy_dt1)    
print(dmy_dt2)    
print(dmy_dt3)    

Output:

image

46. Describe Data Operations in Pandas?

In Pandas, there are different useful data operations for DataFrame, which are as follows:

  • Row and column selection

We can select any row and column of the DataFrame by passing the name of the rows and columns. When you select it from the DataFrame, it becomes one-dimensional and considered as Series.

  • Filter Data

We can filter the data by providing some of the boolean expressions in DataFrame.

  • Null values

A Null value occurs when no data is provided to the items. The various columns may contain no values, which are usually represented as NaN.

47. Define GroupBy in Pandas?

In Pandas, groupby() function allows us to rearrange the data by utilizing them on real-world data sets. Its primary task is to split the data into various groups. These groups are categorized based on some criteria. The objects can be divided from any of their axes.

48. Can you explain multi-indexing columns in Pandas?

Because it involves data manipulation and analysis, multiple indexing is characterized as vital indexing. This is certainly relevant when operating with hyperdimensional data.  It also allows us to store and modify data in lower-dimensional data structures like  DataFrame and series with an indefinite number of dimensions.

Multiple Index Columns

Two columns will be used as index columns in this case. The drop option is used to remove a column, whereas the append attribute is used to append given columns to an index column that already exists.

Example: 

# importing pandas library from
# python
import pandas as pd
# Creating data
Information = {'name': ["Jon", "Mikel", "Joy", "Bill"],
                
            'Jobs': ["Software Developer", "System Engineer",
                        "Footballer", "Singer"],
                
            'Annual Salary(L.P.A)': [12.4, 5.6, 9.3, 10]}
 
# Data Framing the whole data
df = pd.DataFrame(dict)
 
# Showing the above data
print(df)

Output:

nameJobsAnnual Salary(L.P.A)
0JonSoftware Developer12.4
1MikelSystem Engineer5.6
2JoyFootballer9.3
3BillSinger10

49. What is the full form of pandas?

The acronym for “Python Data Analysis Library” is “Python Data Analysis Library.” The phrase comes from the multiple linear regression term “panel data,” which applies to dimensional discrete classes “Pandas,” according to the Wikipedia article. However, I feel it is a catchy moniker for a fantastic Python package!

50. What type of inputs are accepted by pandas?

Like Series, DataFrame accepts many different kinds of input:

  • Dict of 1D ndarrays, lists, dicts, or Series.
  • 2-D numpy. ndarray.
  • Structured or recorded ndarray.
  • A Series.
  • Another DataFrame.
image 77

51. Which are the data structures available with Pandas?

Series and Data Frames are the two basic types of data structures supported by Pandas. Series is a one-dimensional data structure, whereas DataFrames are two-dimensional data structures.

52. List some statistical functions in Python Pandas?

Some of the statistical functions in Python Pandas are:

sum() – it returns the sum of the values.

mean() – returns the mean that is the average of the values.

std() – returns the standard deviation of the numerical columns.

min() – returns the minimum value.

max() – returns the maximum value.

abs() – returns the absolute value.

prod() – returns the product of the values.

53. How to convert a DataFrame to an array in Pandas?

The function to_numpy() is used to convert the DataFrame to a NumPy array.

//syntax
DataFrame.to_numpy(self, dtype=None, copy=False)

The dtype parameter defines the data type to pass to the array and the copy ensures the returned value is not a view on another array.

54. List some alternatives of Python Pandas?

Some of the alternatives to the Python Pandas are:

  • the NumPy,
  • R language,
  • Anaconda,
  • SciPy,
  • PySpark,
  • Dask,
  • Pentaho Data, and Panda.

55. What is Vectorization in Python pandas?

Vectorization is the process of running operations on the entire array. This is done to reduce the amount of iteration performed by the functions. Pandas have a number of vectorized functions like aggregations, and string functions that are optimized to operate specifically on series and DataFrames. So it is preferred to use the vectorized pandas functions to execute the operations quickly.

56. How to Apply function to every row in a Pandas DataFrame?

 Python is a great language for performing data analysis tasks. It provides with a huge amount of Classes and function which help in analyzing and manipulating data in an easier way.

One can use apply() function in order to apply function to every row in given dataframe. Let’s see the ways we can do this task.

Example

# Import pandas package 
import pandas as pd
 
# Function to add
def add(a, b, c):
    return a + b + c
 
def main():
     
    # create a dictionary with
    # three fields each
    data = {
            'A':[1, 2, 3], 
            'B':[4, 5, 6], 
            'C':[7, 8, 9] }
     
    # Convert the dictionary into DataFrame 
    df = pd.DataFrame(data)
    print("Original DataFrame:\n", df)
     
    df['add'] = df.apply(lambda row : add(row['A'],
                     row['B'], row['C']), axis = 1)
  
    print('\nAfter Applying Function: ')
    # printing the new dataframe
    print(df)
  
if __name__ == '__main__':
    main()

Output:

Q56

57. How To Format The Data in Your Pandas DataFrame?

Most of the times, you will also want to be able to do some operations on the actual values that are in your DataFrame.

Keep on reading to find out what the most common Pandas questions are when it comes to formatting your DataFrame’s values!

Replacing All Occurrences of a String in a DataFrame:

To replace certain Strings in your DataFrame, you can easily use replace(): pass the values that you would like to change, followed by the values you want to replace them by.

Note that there is also a regex argument that can help you out tremendously when you’re faced with strange string combinations. In short, replace() is mostly what you need to deal with when you want to replace values or strings in your DataFrame by others.

Removing Parts From Strings in the Cells of Your DataFrame:

Removing unwanted parts of strings is cumbersome work. Luckily, there is a solution in place! You use map() on the column result to apply the lambda function over each element or element-wise of the column. The function in itself takes the string value and strips the + or — that’s located on the left, and also strips away any of the six aAbBcC on the right.

Splitting Text in a Column into Multiple Rows in a DataFrame:

Splitting your text into multiple rows is quite complex. For a complete walkthrough, go here.

Applying A Function to Your Pandas DataFrame’s Columns or Rows:

You might want to adjust the data in your DataFrame by applying a function to it. Go to this page for the code chunks that explain how to apply a function to a DataFrame.

58. Does Pandas Recognize Dates When Importing Data?

Pandas can recognize it, but you need to help it a tiny bit: add the argument parse_dates when you’reading in data from, let’s say, a comma-separated value (CSV) file.

There are, however, always weird date-time formats.

(Honestly, who has never had this?)

In such cases, you can construct your own parser to deal with this. You could, for example, make a lambda function that takes your DateTime and controls it with a format string.

59. When, Why And How You Should Reshape Your Pandas DataFrame

Reshaping your DataFrame is basically transforming it so that the resulting structure makes it more suitable for your data analysis.

In other words, reshaping is not so much concerned with formatting the values that are contained within the DataFrame, but more about transforming the shape of it.

This answers the when and why. Now onto the how of reshaping your DataFrame.

There are three ways of reshaping that frequently raise questions with users: pivoting, stacking and unstacking and melting.

Keep on reading to find out more!

Remember that if you want to see code examples and want to practice your DataFrame skills in our interactive DataCamp environment, go here.

Pivoting Your DataFrame

You can use the pivot() function to create a new derived table out of your original one. When you use the function, you can pass three arguments:

  • Values: this argument allows you to specify which values of your original DataFrame you want to see in your pivot table.
  • Columns: whatever you pass to this argument will become a column in your resulting table.
  • Index: whatever you pass to this argument will become an index in your resulting table.

When you don’t specifically fill in what values you expect to be present in your resulting table, you will pivot by multiple columns. Note that your data can not have rows with duplicate values for the columns that you specify. If this is not the case, you will get an error message. If you can’t ensure the uniqueness of your data, you will want to use the pivot_table method instead .

Using stack() and unstack() to Reshape Your Pandas DataFrame

You have already seen an example of stacking in the answer to question 5!

Good news, you already know why you would use this and what you need to do to do it.

To repeat, when you stack a DataFrame, you make it taller. You move the innermost column index to become the innermost row index. You return a DataFrame with an index with a new inner-most level of row labels.

Go back to the full walk-through of the answer to question 5 “Splitting Text Into Multiple Columns” if you’re unsure of the workings of `stack().

The inverse of stacking is called unstacking. Much like stack(), you use unstack() to move the innermost row index to become the innermost column index.

Reshaping Your DataFrame With Melt()

Melting is considered to be very useful for when you have a data that has one or more columns that are identifier variables, while all other columns are considered measured variables.

These measured variables are all “unpivoted” to the row axis. That is, while the measured variables that were spread out over the width of the DataFrame, the melt will make sure that they will be placed in the height of it. Or, yet in other words, your DataFrame will now become longer instead of wider.

As a result, you just have two non-identifier columns, namely, ‘variable’ and ‘value’.

60. How To Write a Pandas DataFrame to a File

When you have done your data munging and manipulation with Pandas, you might want to export the DataFrame to another format. This section will cover two ways of outputting your DataFrame: to a CSV or to an Excel file.

Outputting a DataFrame to CSV

To output a Pandas DataFrame as a CSV file, you can use to_csv().

Writing a DataFrame to Excel

Very similar to what you did to output your DataFrame to CSV, you can use to_excel() to write your table to Excel.

61. How will you get the top 2 rows from a DataFrame in pandas?

# Select the first 2 rows of the Dataframe
dfObj1 = empDfObj.head(2)
print(“First 2 rows of the Dataframe : “)
print(dfObj1)

Output:

Q61

62. How will you get the average of values of a column in pandas DataFrame? 

Pandas dataframe.mean() function return the mean of the values for the requested axis. If the method is applied on a pandas series object, then the method returns a scalar value which is the mean value of all the observations in the dataframe. If the method is applied on a pandas dataframe object, then the method returns a pandas series object which contains the mean of the values over the specified axis.

Syntax: DataFrame.mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

Parameters :

axis : {index (0), columns (1)}

skipna : Exclude NA/null values when computing the result

level : If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns : mean : Series or DataFrame (if level specified)

Example : Use mean() function to find the mean of all the observations over the index axis.

# importing pandas as pd
import pandas as pd
 
# Creating the dataframe 
df = pd.DataFrame({"A":[12, 4, 5, 44, 1],
                   "B":[5, 2, 54, 3, 2], 
                   "C":[20, 16, 7, 3, 8],
                   "D":[14, 3, 17, 2, 6]})
 
# Print the dataframe
df

Output: 

image

Let’s use the dataframe.mean() function to find the mean over the index axis.

# Even if we do not specify axis = 0,
# the method will return the mean over
# the index axis by default
df.mean(axis = 0)

Output: 

image

63. How can you check if a DataFrame is empty in pandas?

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. It can be thought of as a dict-like container for Series objects. This is the primary data structure of the Pandas.

Pandas DataFrame.empty attribute checks if the dataframe is empty or not. It return True if the dataframe is empty else it return False.

Syntax: DataFrame.empty

Parameter : None

Returns : bool

Example #1: Use DataFrame.empty attribute to check if the given dataframe is empty or not

 
# importing pandas as pd
import pandas as pd
 
# Creating the DataFrame
df = pd.DataFrame({'Weight':[45, 88, 56, 15, 71],
                   'Name':['Sam', 'Andrea', 'Alex', 'Robin', 'Kia'],
                   'Age':[14, 25, 55, 8, 21]})
 
# Create the index
index_ = ['Row_1', 'Row_2', 'Row_3', 'Row_4', 'Row_5']
 
# Set the index
df.index = index_
 
# Print the DataFrame
print(df) 

Output : 

image

Now we will use DataFrame.empty attribute to check if the given dataframe is empty or not.

# check if there is any element
# in the given dataframe or not
result = df.empty
# Print the result
print(result)

Output :

Q63

As we can see in the output, the DataFrame.empty attribute has returned False indicating that the given dataframe is not empty.

64. How can we retrieve a row in pandas DataFrame ?

Pandas provide a unique method to retrieve rows from a Data frame. DataFrame.loc[] method is a method that takes only index labels and returns row or dataframe if the index label exists in the caller data frame.

Syntax: pandas.DataFrame.loc[ ]

Parameters:

Index label: String or list of string of index label of rows

Return type: Data frame or Series depending on parameters

65. What is pylab?

PyLab is a package that contains NumPy, SciPy, and Matplotlib into a single namespace.

66. What are operations on Series in pandas?

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.

Creating a Pandas Series-

In the real world, a Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas Series can be created from the lists, dictionary, and from a scalar value etc. Series can be created in different ways, here are some ways by which we create a series:

Creating a series from array: In order to create a series from array, we have to import a numpy module and have to use array() function.

# import pandas as pd
import pandas as pd
 
# import numpy as np
import numpy as np
 
# simple array
data = np.array([‘p’,’a’,’n’,’d’,’a’,’s’])
 
ser = pd.Series(data)
print(ser)

Output :

Q66
Select the fields to be shown. Others will be hidden. Drag and drop to rearrange the order.
  • Image
  • SKU
  • Rating
  • Price
  • Stock
  • Availability
  • Add to cart
  • Description
  • Content
  • Weight
  • Dimensions
  • Additional information
Click outside to hide the comparison bar
Compare

Subscribe to Newsletter

Stay ahead of the rapidly evolving world of technology with our news letters. Subscribe now!