Learn

Introduction to Statistics with NumPy

Introduction

You’re a citizen scientist who has started collecting data about rising water in the river next to where you live. For months, you painstakingly measure the water levels and enter your findings into a notebook. But at the end of it, what exactly do you have? What can all this data tell us?

In this lesson, we’ll explore how we can use NumPy to analyze data. We’ll learn different methods to calculate common statistical properties of a dataset, such as finding the mean and standard deviation. By the end, you’ll be able to do basic analysis of a dataset and understand how we can use statistics to come to conclusions about data.

The statistical concepts that we’ll cover include:

- Mean
- Median
- Percentiles
- Interquartile Range
- Outliers
- Standard Deviation

To start, we’ll be analyzing *single-variable* datasets. One way to think of a single-variable dataset is that it contains answers to a question. For instance, we might ask 100 people, “How tall are you?” Their heights in inches would form our dataset.

For our purposes, we’ll be organizing our datasets into NumPy arrays. To learn more about NumPy arrays, take our course Learn NumPy: Introduction.

To your right, you’ll find a Jupyter notebook with some example calculations using NumPy. We won’t be using Jupyter notebooks in this lesson, but they’re a great way of combining text, code, and visualization. You can find more about them on the Jupyter website.

Don’t worry about understanding the individual lines of code; this example is just meant to show you the types of things that you’ll be learning in this lesson.

When you’re ready, continue to the first exercise.