Data Distribution
Learn all about Data Distribution in this comprehensive tutorial.
- •Earlier in this tutorial we have worked with very small amounts of data in our examples, just to understand the different concepts.
- •To visualize the data set we can draw a histogram with the data we collected.
- •An array containing 250 values is not considered very big, but now you know how to create a random set of values, and by changing the parameters, you can create the data set as big as you want.
Data Distribution
Earlier in this tutorial we have worked with very small amounts of data in our examples, just to understand the different concepts.
In the real world, the data sets are much bigger, but it can be difficult to gather real world data, at least at an early stage of a project.
To create big data sets for testing, we use the Python module NumPy, which comes with a number of methods to create random data sets, of any size.
Histogram
To visualize the data set we can draw a histogram with the data we collected.
We will use the Python module Matplotlib to draw a histogram.
Learn about the Matplotlib module in our Matplotlib Tutorial.
We use the array from the example above to draw a histogram with 5 bars.
The first bar represents how many values in the array are between 0 and 1.
The second bar represents how many values are between 1 and 2.
Etc.
Which gives us this result:
- 52 values are between 0 and 1
- 48 values are between 1 and 2
- 49 values are between 2 and 3
- 51 values are between 3 and 4
- 50 values are between 4 and 5
Big Data Distributions
An array containing 250 values is not considered very big, but now you know how to create a random set of values, and by changing the parameters, you can create the data set as big as you want.
Module quiz
2 questionsWhich of the following is true about Data Distribution?
What is the most common pitfall when working with Data Distribution?
Answer all questions to submit.