Data Science Applied to Management Consulting Engagements — Tutorial I
You may have realized as a management consultant at the analyst/associate level that a decent knowledge of data science would have been instrumental in some client engagements for achieving high impact analyses. How often did your analysis requires processing a large amount of data too big for Excel but not big enough to be called Big Data — the “not so Big Data” challenge? When did you wish you knew more than a regression analysis for identifying relationships? How clueless were you when your manager asked you to do an optimal allocation of 100+ resources? How frustrated were you of not finding a way to present data that made your insights obvious? How disappointed was the client when you told her that she could not use your work in production?
However, you are asking yourself which skills I need to acquire and how deep I need to be to be impactful. We have experienced situations where the client expected the consultant to have strong business acumen and outstanding analytics, be the expert in machine learning and statistics, and have better than average programming skills. While some individuals are real polymaths, most of us need to make some compromises. Do you want to be a translator or a data scientist? It is easier for a consultant without a strong background in computer science (or related disciplines) to aspire to be a translator than a data scientist. We suggest wanting to be between a translator and a data scientist, i.e., putting an emphasis on business intelligence and analytics, and learning just enough about programming, statistics, machine learning, or mathematical programming to quickly get to the right insights. If you are a consultant at a top management consulting firm, you already have the intrinsics required for business intelligence and analytics. Experience, curiosity, and creativity will make you better at these skills. The other abilities are complementary but can make a meaningful difference if you are time/budget constrained. A short engagement with high impact usually leads to a proper one as you have now persuaded the client about the benefits. After you have convinced the client about the opportunities, you can always help her structure the critical elements of a project that will involve experts in programming, machine learning, etc., to have a “production” version of your “proof-of-concept”.
A typical engagement involving advanced analytics requires doing the following four steps:
- Request/gather data from various sources (e.g., client, Bloomberg, Internet, etc.) that could come in different formats (e.g., xls, csv, xml, json)
- Put these data into a form that make them easily analyzable (aka data wrangling)
- Perform numerous analyses
- Communicate insights and observations
If step 1 leads to too many files or step 3 requires blending a large amount of data, Excel becomes quickly a very inefficient tool. We can use a combination of commercial tools requiring limited programming skills to do steps 2, 3 and 4 but they are usually expensive and not always flexible. In the other hand, open source tools are free and very flexible but involve a learning curve for most consultants.
Our objective with this project is to put together a series of tutorials for showing how a consultant can use open source tools to achieve steps 2, 3 and 4 on disguised client engagements — we find it more comfortable to learn from examples (top-down approach) than reading the necessary documents before getting started (bottom-up approach). We are planning so far three tutorials:
- How to perform essential data wrangling/blending and analyses (this tutorial)
- How to use mathematical programming for accomplishing more sophisticated/insightful analyses
- How to use machine learning techniques for identifying patterns that can lead to commercial insights
All tutorials revolve around the following stack:
- Julia for mathematical programming and numerical computing
- Jupyter for literate programming (mostly used with Julia and Python)
- Python for data wrangling and analyses,
- R for analyses and communicating results, insights and observations — we can re-use some of the graphics produced in R for creating a PowerPoint document.
- SQLite for storing and querying the data (“blending”)
We provide a curated list of data science resources that you can consult to get more familiar with these tools and approaches.
Why going through the troubles of learning three programming languages (i.e., Julia, Python, and R)? Based on our experience, Python is perfect for “gluing” things together — it has an extensive collection of libraries; R has a very polished literate programming environment (through RMarkdown) and outstanding graphics that allow producing client-ready deliverables; and, Julia has the flexibility of a scripting language but the performance of a compiled language (e.g., C and C++).
Hopefully, this tutorial should ease the learning curve of using open source tools and lead to better insights — our problem-solving capability should be the limiting factor, not the tools we use.
You will find all the source files associated with this tutorial (including a markdown version of this article) on GitHub.
You can send any feedback or questions to firstname.lastname@example.org.
About the tutorial
You will learn in this tutorial the following elements:
- How to create a relational database using SQLite and SQL
- How to populate the database with the client’s data in Python
- How to run the analyses in R and use RMarkdown to communicate the findings — see this notebook as an example
The client owns an efficient gas-fired power plant that is offered every day in the spot market. The dispatch team is responsible for scheduling the plant to maximize its margin while meeting all its dynamic constraints (e.g., start-up time, ramping, etc.). We have been asked to identify potential dispatch opportunities, i.e., ways to schedule the plant that lead to more profits.
This is how we organize our folders:
│ ├── client # Client's data
│ ├── processed # Data that we have processed
├── images # Images common to various documents
├── notebooks # Analyses
├── reports # Progress review documents
└── src # Applications, scripts, etc.
To perform the analysis, the client provides the historical spot prices associated with power, natural gas and carbon, and past generation. The following Excel files include all the client’s data:
data/client/gas and carbon prices.xlsx(natural gas and carbon daily spot prices)
data/client/power prices.xlsx(power hourly spot prices)
Our objective is to structure the client’s data correctly and produce informative labels (e.g., types of periods for time series) such that it will be easy for us to do our analysis.
We have implemented a script in Python
src/python/make_dataset.py that reads the client's Excel files and populates an SQL database. We create first the database by running the SQL code in
src/sql/create_tables.sql. We have documented both the script and SQL code — please refer to the code for more details.
Analyses and communication
We have built a notebook (
notebooks/r/analyses_v1.Rmd) using RMarkdown (with RStudio) to run the analyses and communicate the insights. RMarkdown is a form of literate programming beneficial for communicating complicated analyses.
These instructions will get you a copy of the project on your machine.
We would have used Docker if it was more popular among management consultants — it allows you to automatically install the various applications and packages required for a given project without interfering with your current environment.
You need at least to install the following software:
- SQLite (and you may want to get DB Browser for SQLite)
- Python 3.5+, or
- Anaconda (better environment for Python that comes with many libraries already installed)
- R Studio
We do our development on MacOS and use the following software (in addition to the above):
- Atom a “hackable text editor for the 21st Century”
- Git a version control system for tracking changes
- GNU Make a tool which controls the generation of executables and other non-source files of a program from the program’s source files
If you are on Windows, you can get Git and GNU Make (and other Unix utilities) by installing Cygwin.
You need to clone the Git repository by running
git clone https://github.com/pyxidr/data-science-consulting-part1.git or download it directly from GitHub.
To ensure you have all the Python packages, run the following in the root directory:
conda install --yes --file requirements.txt or
pip install -r requirements.txt
Building the database
There are 2 approaches:
Using GNU Make
In the root directory (where you will find the file
Makefile), run simply
In the root directory, run the following commands:
python make_dataset.py -d ../../data/processed/tutorial1.db -V
You can compact the database by running
sqlite3 tutorial1.db 'VACUUM;' in folder
Running the analyses
In RStudio, open project
notebooks/r/R.Rproj and then file
notebooks/r/analyses_v1.Rmd. Once you have opened this file, click on button Knit to produce an html version of the notebook.