Plotting charts in Python vs. Excel: A Demo

We previously discussed here that plotting charts in Python is second priority for beginners. Indeed, Python’s matplotlib library is very useful for creating elegant charts from large data sets. However, you need to remember a lot of code to make it work well!

Example: COVID-19 public data set

We will use the COVID-19 Case Surveillance Public Data to demonstrate the difference between Excel and Python for plotting charts. For example, suppose we want to see whether we’ve improved at protecting senior citizens. Then, the two simplest charts we could look at are:

  • How many cases there were over time in each age group.
  • The percentage of total cases in each age group.

The data set is too big for Excel, so we need to analyze it in Python first

First, we download the .csv data file and read it to Python. As we’ve probably guessed from reading the news, the data has several million rows.

Code for importing libraries and reading the file. The file has 4.5 million rows.

Each row in the data set represents one COVID-19 patient, and contains details about them. We can look at the data columns and their data types using the code below. Hence, we find that every column has text data.

Output of the .dtypes attribute for the DataFrame of COVID-19 surveillance data. Every column is an "object" datatype, meaning that it contains text data.

Before we do any plotting, we need to first transform the date columns from text into dates. To this end, we use the pd.to_datetime function. Subsequently, we also check whether the DataFrame has missing dates, using .isna(). Because the positive test date (pos_spec_dt) and the symptom onset date (onset_dt) each have more than 2 million missing values, we can only use the date that the case was reported to CDC (cdc_report_dt).

Code converting 3 date columns: the CDC report date, positive test date, and symptom onset date into datetime values.

Cleaning the data and grouping into daily time series by age group

Second, we want to clean the data to just the records which have a patient age and are after March 1. With the code below, we can see that we still have 99.8% of the original rows. Therefore, our data should still be fairly representative of the original set.

Code to filter for only patients with a known age group, and data collected on or after 1 March 2020.

Third, we can use pd.crosstab to group the data with the CDC report date as rows, and the number of patients in each age group for that date as columns. We will then get 199 days’ worth of data from March 1 to September 15, which we can write to a .csv file.

Code and output to show how to use pd.crosstab to get the number of patients per day in each age group.

At this point, we can take a look at the daily trends in Excel. Below, I created the two charts using Excel’s default settings. The only change I made was to edit the chart titles to show clearly what we are plotting. We can see that the data is rather noisy, with huge spikes on a few days. Therefore, we can’t really see if there was a spike in senior citizen case numbers, even though we can tell that the percentage of total cases affecting senior citizens aged 60 and above has decreased over time from over 30% of the total, to stabilize at about 20%.

Charts of daily COVID-19 cases by age group - stacked bar chart of absolute cases, and area chart showing the percentage of total cases.

At this point, hopefully you appreciate how Python can convert 4.5 million rows of data into 199 rows that you can plot in Excel, with just 5 lines of code! Still, we can do more to make the charts easier to interpret.

Regrouping the data into weeks for plotting charts in Python

Fourth and last, we want to group the data into weekly totals, to get rid of the sharp daily variations. Because the .resample method groups weeks from Monday to Sunday, I filtered the data to the earliest Monday and the latest Sunday before using it. Also, I set the index of the DataFrame to the CDC report date, because .resample only works when the dates are in the index column.

Code for resampling the daily data into weekly totals. We end up with a DataFrame showing the end date of each week, and the sum of cases in each age group for the week in the columns.

You can see “method chaining” in the code above, meaning that we string together as many methods as possible in the order that we want to perform them. Experienced programmers like to do this, in order to use fewer lines of code. Also, the back slash after set_index indicates that the code will continue on to a new line. This is useful when you have a line of code that is wider than the page, and want others to read your code without having to scroll right.

Plotting charts in Python vs. Excel: weekly stacked bar chart

As a start, let’s try plotting the stacked bar chart of total weekly cases in Excel and Python. If we stick to only the bare minimum edits (i.e. just the chart title), neither chart is particularly reader-friendly.

Screen shots of plotting data in Python vs. Excel with default settings, for stacked bar chart of all COVID-19 cases by age group.
Problems with Python chartProblems with Excel chart
1. Legend overlaps with the data.
2. Horizontal axis has date-times instead of dates, so it looks cluttered.
3. Horizontal axis has a title cdc_report_dt from the raw data, which is unsightly.
1. Columns are spaced out too widely. This is because Excel expects there to be daily data, as the horizontal axis has a date format. However, we only have one data row per week.

Next, let’s try to fix the various visual problems. In Python, we need 2 more lines of code to replace the horizontal axis tick marks with dates, and remove the axis title. We also have to put one more keyword argument when calling the .plot method, to make the plot bigger. Conversely, for the Excel chart, we only need to re-format the horizontal axis from a date format to a text format, to get rid of the unsightly gaps.

Comparison of the changes to fix readability issues in the charts when plotting data in Python vs. Excel.

Plotting a stacked area chart in Python vs. Excel

Even though our charts are more aesthetically pleasing now, we still can’t easily draw conclusions about the impact on senior citizens. Because the overall bar heights are so different, it’s difficult to tell if the proportion of cases that are senior citizens has increased or decreased. Therefore, we can try creating a stacked area chart instead.

In Excel, we can do this in a minute or two – just select the data, choose a stacked area chart and edit the title. Not only do the results look presentable, they are easy to read and interpret.

Unlike Excel, we need substantial coding to create the same chart in Python. Firstly, we have to convert the data from raw case counts to the percentage of total in each week for each age group.

Code to convert the data from the total case count to the percentage of weekly totals, using .apply(lambda x: x/x.sum(), axis = 1), for plotting area charts in Python.

Secondly, we need to configure the plotting command for the area chart differently from the stacked bar chart, because of restrictions in the chart types supported by the various plotting methods. When plotting charts, we are using a mixture of three different types of code: the .plot method in pandas, the pyplot API* in matplotlib, and occasionally directly modifying axes objects in the charts. It can get very confusing unless you have a good “cookbook” style reference text, and have the time to consult Stack Overflow when you’re stuck.

* API = Application Programming Interface. Think of it as a “service” where you structure a request in a specific way to get back data, or in this case, to get specific changes to your chart. That way, you don’t need to understand or program all the code that it took to perform that task.

Plotting charts in Python with an area chart - with the default code, there are substantial problems on aesthetics and readability.

Fixing the problems with the stacked area chart

Our plot now has 3 problems:

  • The legend overlaps with the data, obscuring the lower right hand corner of the plot.
  • There is a strip of blank space above 1 on the vertical axis, which is unsightly. It also makes no sense because all the values should add to 1.
  • cdc_report_dt (the name of the DataFrame index) is the default horizontal axis title. Again, this is unsightly.
  • It’s also unclear what the start date is, as the chart only shows the major tick labels. Actually, the minor tick labels are blank.

We need substantially more code to fix the area chart than the bar chart, specifically because we can’t map the x-axis to all its tick labels at one shot.

Code for plotting area charts in Python with the fixes to the aesthetics on the x- and y-axis.

However, after all this work, the chart looks just as clean as the Excel one. It just took almost 20 more lines of code to get there!

Plotting charts in Python with an area chart, after fixing the aesthetic issues.

Afternote: This code would have worked if the major tick marks were evenly spaced to every 4 minor tick marks. Subsequently, I realized that the major tick marks have uneven spacings. Therefore, the coding process is slightly more complex than described above. I’ve posted a tutorial about the final process on my Youtube channel as below:

Conclusion: Python is more flexible but much more complex than Excel for plotting charts.

“Show, don’t tell”. With this mantra, I set out to create this demo – and ended up with only half of the planned content after all these words! Hopefully, this will demonstrate to you that Python allows a lot of flexibility to tweak different parts of charts. Unfortunately, that can create too much complexity. For example, I created the Excel version of the stacked area chart in a couple of minutes, but took 2 hours to debug the re-labeling of the x-axis in Python.

Therefore, I’ve deliberately left charts out of the Pandas for Productivity series scope. This doesn’t mean that you will never use matplotlib and seaborn. Rather, it means that you’ll choose to use it if you need to make complex statistical plots (e.g. box plots, violin plots, probability density functions). Otherwise, you might use it for scatter plots of more than a million rows of data. But for straightforward business charts, it’s much quicker to use Excel.

Happy plotting!

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.