Querying Numpy Arrays Using the Where() Method in Python

NumPy Tutorial: How to use numpy.where()

When you are working on a large data set, it becomes extremely difficult to look out for entries that satisfy a particular condition. And still more hard to make changes to it. 

Do you want to know how to transform an array(or list) at once based on some condition? Then, read along. In this article, we will be discussing the numpy.where() function and different ways of using the numpy.where() function.

After reading this article, you will know :

  • What is numpy.where() and its syntax.
  • Using numpy.where() with single condition.
  • Using numpy.where() with multiple conditions.
  • Using numpy.where() with conditions and optional arguments x,y.
  • Using numpy.where() on 1D,2D and multi-dimensional Arrays.
  • Real World Use-Cases of numpy.where()

< Previous Numpy tutorial

Syntax of numpy.where()

numpy.where(condition[, x, y])

Parameters :

condition – can be a NumPy Boolean array or array-like object(list) that holds either True or False or can be a conditional expression that evaluates to a Boolean value( True or False) or a NumPy Boolean array or array-like object.

For example, 

  • It can be a NumPy Boolean array.
#example for NumPy Boolean array
arr_bool=np.array([True,False,True])
print(arr_bool)

###
array([ True, False,  True])
  • It can be an expression that evaluates and returns a NumPy Boolean array.
arr1=np.array([1,2,3,4,5,6])#expession that returns a Numpy Boolean array
print(arr1>3)

###
array([False, False, False,  True,  True,  True])
  • It can be an array-like object (list).
#Array-like object
arr2=[True,False,False]
print(arr2)

###
[True, False, False]
  • A simple conditional expression that evaluates to either True or False.

Eg: 9>5 

  • A multi conditional expression that evaluates to either True or False. 

Eg : x>y & y<z

x,y – These are Optional Arguments. Either both x and y are passed or both are not passed.

If the condition is evaluated to True, x is returned in the result array.

If the condition is evaluated to False, y is returned in the result array.

x,y can be 

  • Single Values. 
  • Variables 
  • NumPy arrays
  • Array-like objects(lists) 

NOTE: If all the three parameters(condition, x,y) are arrays, then they have to be of broadcastable shape. Else, there will be an error.

Return Value: 

This function returns an array-like object or NumPy array.

Working :

  • When only the condition is passed to the function, the function returns a NumPy array or an array-like object that contains the indices of the elements that satisfy the given condition.
  • When the condition is passed along with the x,y; the function returns a NumPy array or array-like object that contains the elements from array x if the condition is true and elements from array y if the condition is False.

Now, that we know the Syntax, let us try to understand the behavior of this function in different scenarios with examples.

Example 1: numpy.where() with single condition and x,y parameters are single values.

Consider an example where you have to find if the integers in the given array are Positive or Negative

import numpy as np

#an array with integers
arr1=np.array([-1,-2,2,3,4,1,-8,-5,-3])

result= np.where(arr1>0,"Positive","Negative")
print(result)

The output of the above program will be :

['Negative' 'Negative' 'Positive' 'Positive' 'Positive' 'Positive'
'Negative' 'Negative' 'Negative']

Explanation: In the above example, we first create an array. In the function np.where() we pass an condition expression arr1>0. This expression evaluates to [False, False, True, True, True, True, False, False, False]. For every True entry, Positive is placed in the result array. For every False entry, Negative is placed in the result array. 

Below is a pictorial explanation,

Example 2: numpy.where() with single condition and x,y are variables

Consider the below example, where we are passing variables- x,y. 

import numpy as np

#an array with integers
arr1=np.array([-1,-2,2,3,4,1,-8,-5,-3])

# assigning required values to the variables
x="Positive"
y="Negative"

#Note that we are calling these variables in the function call
result= np.where(arr1>0,x,y)
print(result)

The output is:

['Negative' 'Negative' 'Positive' 'Positive' 'Positive' 'Positive'
'Negative' 'Negative' 'Negative']

Explanation: The logic remains the same as the previous example. Instead of using single values like “Positive”, “Negative”, we are using variables to store these values and passing the variables to the function call.

Example 3: numpy.where() with single condition and x,y as array-like objects

Consider an example where we have to convert the negative elements in an array with positive elements( by multiplying * -1)

import numpy as np

#an array with integers
arr1=np.array([-1,-2,2,3,4,1,-8,-5,-3])

result= np.where(arr1>0,arr1,arr1*(-1))
print(result)

The output is : 

[1 0 2 3 4 1 8 5 3]

Explanation :

In the above example, 

  • arr1>0 is a condition expression that returns a Boolean array  

[False, False, True, True, True, True, False, False, False]

  • x array is arr1 with values [-1,-2,2,3,4,1,-8,-5,-3]
  • y array is arr1*(-1) that evaluates to [1,2,-2,-3,-4,-1,8,5,3]
  • Every False in the Boolean array is replaced with an element of y array i.e arr1*(-1) in our case. So, the first element will be 1, the second will be 2, and so on.
  • Every True in the Boolean array is replaced with an element of x array i.e arr1 in our case. So, the third element will be picked from arr1 i.e 2, the fourth element will be picked from arr1 i.e 3, and so on.

Note that the parameters within the function call, namely, condition, x,y should be of broadcastable shape. Else, there will be an error.

For example, let us see what happens if we had a condition like [True, False] instead of arr1>0

import numpy as np

#an array with integers
arr1=np.array([-1,-2,2,3,4,1,-8,-5,-3])

#creating a boolean array to pass as an argument to the function call
con = np.array([True,False])

result= np.where(con,arr1,arr1*(-1))
print(result)

An error is raised as shown below,

Traceback (most recent call last):
  File "C:\Users\admin\Desktop\beapythondev\main.py", line 6, in <module>
    result= np.where(con,arr1,arr1*(-1))
  File "<__array_function__ internals>", line 5, in where
ValueError: operands could not be broadcast together with shapes 
  (2,) (9,) (9,) 

The error is seen in this case, as the boolean array is not broadcastable to the shape of the x,y arrays. 

If the Boolean array had just one value,([True] or [False]) it would be broadcastable and no error is shown. Refer to the below example

import numpy as np

#an array with integers
arr1=np.array([-1,-2,2,3,4,1,-8,-5,-3])

#creating a boolean array to pass as an argument to the function call
con = np.array([True])

result= np.where(con,arr1,arr1*(-1))
print(result)

The output is 

[-1 -2  2  3  4  1 -8 -5 -3]

Explanation: In this case, the boolean array could broadcast itself as [True, True, True, True, True, True, True, True, True]. Thus, every True value is replaced with an element from x array i.e arr1 in this case.

Example 5:  numpy.where() with a condition only

Consider an example where you have to find the indices of the non-zero elements in an array. To achieve this, we can pass only the condition argument without the array arguments.

import numpy as np

#an array with integers
arr1=np.array([0,1,3,4,0,0,0,8,6,0,4])

#note that only condition is passed without the array arguments
result= np.where(arr1!=0)
print(result)

The output would be :

(array([ 1,  2,  3,  7,  8, 10], dtype=int64),)

Explanation: The conditional expression arr1!=0 evaluates to [False True True True False False False  True True False True]. Indices of the elements that evaluate to True are passed on to the resultant array. Thus, the result array would have the values [1,2,3,7,8,10] which represent the index of the non-zero elements in the given array. Note that, the function returns a tuple that contains a numpy array holding the index values for each axis. In this case, the tuple contains just one array as arr1 is a 1D array.

Now, let us see what happens when we a have 2D array.

import numpy as np

#create a 2D array
arr1=np.array([[0,1,3,4],
              [0,0,0,8],
              [6,0,4,9]])

#note that only condition is passed without the array arguments
result= np.where(arr1!=0)
print(result)

The output would be 

(array([0, 0, 0, 1, 2, 2, 2], dtype=int64), 
array([1, 2, 3, 3, 0, 2, 3], dtype=int64))

Explanation

  • In this case, the conditional expression evaluates to a 2D array

[[False  True  True  True]

 [False False False  True]

 [ True False  True  True]]

  • The function returns a tuple, that contains two numpy arrays(one for each dimension) holding the index values of the elements that evaluate to True.
  • Below is a pictorial representation of how to use these arrays to identify the indices of the elements.

Real-World Use Cases

Now, let us see a few examples where numpy.where() can be used in real-world scenarios.

Creating/Modifying Columns based on Condition in Pandas 

If you are aware of SQL queries, there is a WHERE clause. This clause is generally used to specify a condition based on which the entries are selected. The numpy.where() function can be used similar to the SQL WHERE clause. In this case, not just to fetch the entries that satisfy the condition, but also to act upon them.

Consider a pandas data frame as shown below

import pandas as pd
import numpy as np

# create a panadas dataframe with values representing 
# Card Card Expenditure of people
df  = pd.DataFrame({"Customer Name"   :["Aaron", "Baron", "Carrel",
                                        "Dinky", "Pinky", "Scott"],
                  "Grocery_Purchases": [180, 400, 90, 79, 468, 78],
                  "Total Purchase"   :[1800, 900, 500,400, 1000, 100] })

print(df)

The output is 

Customer Name  Grocery_Purchases  Total Purchase
0         Aaron                180            1800
1         Baron                400             900
2        Carrel                 90             500
3         Dinky                 79             400
4         Pinky                468            1000
5         Scott                 78             100

This is a simple data set representing the data from a credit card expenditure of customers of a bank. Let’s say the bank decides to give a cashback of 5% of their total purchase to the customers who spend at least $700 per month provided they have a grocery expenditure of at least $100.

To calculate the Cashback for eligible customers we can use numpy.where() as shown below

df['Cashback']=np.where((df.Total_Purchase>=700) & 
(df.Grocery_Purchases>=100), df.Total_Purchase*0.05 , 0 )
print(df)

The output is 

Customer_Name  Grocery_Purchases  Total_Purchase  Cashback
0         Aaron                180            1800      90.0
1         Baron                400             900      45.0
2        Carrel                 90             500       0.0
3         Dinky                 79             400       0.0
4         Pinky                468            1000      50.0
5         Scott                 78             100       0.0

Explanation: First we check for entries whose Total_Purchase value is greater than or equal to 700 and their Grocery Purchases values are greater than 100. We create a column named Cashback. The cashback value is calculated only if the specified condition is met. Else the Cashback value is 0.

Data Masking involving boolean arrays as conditions

Consider a calculation as shown, ab+(1-a)c

where the value of a can either be 0 or 1.

These kinds of equations are used in calculating various kinds of statistical distributions. Example: binary cross-entropy, -(i * log(j) + (1 – i) * log(1 – j)).mean()

Here a -> i and b->log(j) and c->log(1-j)

We can implement the numpy.where() function in such cases.Consider a below example

import numpy as np

# create a 3d array with random element
b=np.arange(0,27).reshape(3,3,3)

# Create an array that contains values 0 and 1 
# to represent a boolean array.
a=np.random.randint(0, 2, size=(3, 3, 3))

print(a)

>>> [[[1 1 0]
  [1 0 1]
  [0 0 0]]

[[1 0 0]
  [1 0 0]
  [0 1 0]]

[[1 0 1]
  [1 1 1]
  [1 1 0]]]

print(b)

>>>[[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]]

[[ 9 10 11]
  [12 13 14]
  [15 16 17]]

[[18 19 20]
  [21 22 23]
  [24 25 26]]]

print(c)

>>>[[[28 29 30]
  [31 32 33]
  [34 35 36]]

[[37 38 39]
  [40 41 42]
  [43 44 45]]

[[46 47 48]
  [49 50 51]
  [52 53 54]]]

# The expression ab+(1-a)b yields
a*b + (1-a)*c

>>>array([[[28, 29, 30],
        [31,  4, 33],
        [34,  7, 36]],

      [[37, 38, 39],
        [12, 41, 42],
        [43, 16, 17]],

      [[46, 47, 20],
        [49, 22, 51],
        [52, 53, 54]]])

When you perform np.where(a,b,c) you get the same result.

np.where(a,b,c)

>>> array([[[28, 29, 30],
        [31,  4, 33],
        [34,  7, 36]],

      [[37, 38, 39],
        [12, 41, 42],
        [43, 16, 17]],

      [[46, 47, 20],
        [49, 22, 51],
        [52, 53, 54]]])

Note that in real-world scenarios, the value of b,c won’t be straightforward and would have some logarithmic, trigonometric values.

That’s All.

We hope this tutorial has been informative. Stay tuned for more such tutorials.

Till then happy Pythoning!

Authored by: Anusha Pai
About: Anusha is a Software Engineer with good experience in the IT industry. She has been using Python throughout her career. She has a passion for writing. She loves writing about Windows and Python-related stuff. In her free time, she practices Yoga.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s