Pandas: Do work quickly. And learn programming!

I love pandas! This kind, of course…

Tare Panda… squeee!!! (Image by Picgifs)

… but also the Python Data Analysis Library, an open-source software library. Even if you’re not trained as a programmer (like me), it can become a valuable time-saving tool. And here’s why.

The data looks like Excel. So, it’s familiar.

Pandas uses two basic formats (“objects”, in programming-speak) to store data: DataFrames and Series. And reassuringly, a DataFrame looks like an Excel spreadsheet:

Example of a pandas data frame
Source: COVID-19 Case Surveillance Public Use Data

Furthermore, a Series is like a DataFrame, with just one column of data. In other words, it’s like a column in an Excel spreadsheet. So far, so good.

You already know Excel inside out. Why use pandas?

However, data volume is growing so quickly that Excel can’t always keep up. Here are some examples when you need pandas:

  • You have more than a million rows of data. ‘Nuff said.
  • There’s data from flat files (e.g. external vendor reports) that you want to bring together with internal data… and don’t want to spend hours creating vlookups, that then take even longer to populate.
  • You’re still poking around to look for a story to tell, and you want to explore different splits of your data without bombarding your database with repeated SQL queries.
  • A report will repeat regularly (e.g. a monthly financial report) and you want to refresh it in minutes. For further efficiency, you can rotate with other colleagues to re-run it reliably.

It might sound crazy, but yes, even as a non-programmer, if you’re a heavy user of Excel, pandas can be a time-saving productivity tool for you. In all the examples above, working with Excel can take hours, whereas pandas shortens your turnaround time to minutes.

But I’m a total beginner and I have no time to spend on “Hello World”!

That’s what the Pandas for Productivity series is for! With my help, you get to focus on getting work done, while learning Python at the same time. That’s because each post maps directly to a specific task in Excel, and gives you the exact syntax to use the corresponding function or method in Pandas. Best of all, it will explain the underlying Python concepts in the moment, as and when you need to know them. Furthermore, if you do data analysis only occasionally, don’t fret. Concepts and explanations will be repeated in every post that you encounter them, so that you can refresh your memory.

Whatever you do in Excel, here’s how you do it with Python:

Think of this series as a “helpdesk” to guide you for working with Python as if you were working in Excel instead. To this end, the table below contains a directory of posts that pertain to specific tasks.

Housekeep your Jupyter notebooks

Equivalent Excel taskPosts and links
Create and save Excel workbooks, use “Undo”Download Anaconda, create Jupyter notebooks, and save your work using checkpoints.
Copy paste data into worksheetsImport the necessary libraries. Read flat files or the results of SQL code. Write output to a CSV file.
Refresh data manually to re-run your own or a colleague’s analysis.Run or re-run multiple blocks of code in a Jupyter notebook. Stop a run that’s taking too long.

Examine your data and customize the data format for your needs

Equivalent Excel taskPosts and links
Visually inspect the data in your spreadsheetExplore data in a DataFrame and its individual columns.
Rename data to be more user-friendly for your readerRename / reformat the columns in your DataFrame to be more user-friendly to either your computer, or your readers.
Calculate new metrics using formulas and smart fill.Create functions to calculate new metrics, creating new columns in your DataFrame.
Format columns as dates, use YEAR(), MONTH() etc formulas, and perform calculations with dates.Format DataFrame columns as dates, extract date parts, perform calculations with TimeDeltas and date offsets.

Look for insights from your data

Equivalent Excel taskPosts and links
Apply filters to the header rows in your Excel sheets or pivot tables.Filter and query DataFrames.
Filter for blank values in an Excel sheet.Find missing values and fill them where desired.
Create Pivot TablesAggregate rows in a DataFrame with Groupby, pivot table, and crosstab.
Create vlookups between Excel sheets.
Paste new data with the same columns to an existing Excel sheet.
Join or concatenate data in 2 or more DataFrames.
Sort your Excel sheet data using Custom Sort. Rank data using the Excel RANK() function.Sort any column(s) in a DataFrame. Use .rank method to calculate rankings, combined with .groupby to rank within categories.

Where Python isn’t necessarily more productive: Charts

The one exception where Python isn’t necessarily more efficient than Excel, is plotting charts. To illustrate, this post walks through the steps to create two standard Excel charts with Python’s matplotlib library. In summary, Python is too complex for simple business charts, but useful for statistical charts. Furthermore, if you need to plot more than a million rows of data, Python is your only choice. By then, you should not be intimidated to use Stack Overflow, the matplotlib documentation, and a “cookbook” reference text to create and customize charts.

What’s the end-goal after reading this series?

Learning a new programming language opens up endless possibilities. Consequently, you might find yourself inspired to dive deeper after this initial journey! Nonetheless, the series was designed to allow you to gain work-relevant knowledge within a short time. So, after reading 12 posts, what can you do?

As a start, it should help you to make sense of your colleagues’ code. For example, someone from your analytics or data science team wrote a report for you to refresh, you should be able to broadly follow their thought process. Next, you’ll be aware of situations where Excel is time-consuming or unable to handle the volume of data you’re working with. And you’ll start automating your work in Python, using the relevant posts as references when needed.

Finally, after you have grasped the basics, take your coding to the next level with a comprehensive reference book. For example, I used the Pandas 1.x Cookbook and have reviewed it here. Additionally, you will constantly use the official documentation and Stack Overflow as continuing references.

Happy scripting!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.