Pro Bono: IEEE Chicago Section Statistician - Part 3
This post is the third in a series on my pro bono efforts outside of the workplace as Statistician for the IEEE Chicago Section, a new position that I filled this past year after the membership development team realized that ongoing data analyses and reporting was needed around declining enrollment.
The bulk of my time spent on this work over the past year was fixing existing graphs that I had inherited, moving from Microsoft Excel to the R language, maintaining existing data sets, creating new data sets, and creating formal presentations that I started distributing on a monthly basis.
After creating re-runnable R scripts earlier this year for the monthly graphs that the team came to expect, my attention turned toward creation of new visualizations that seek to better understand the data, and I began to spend more time to communicate my analyses for the nonanalytical.
My first post discussed some of my initial thoughts surrounding my early experiences on the team, and presented one of the new visualizations that I created, a heat map that breaks down top technical interest areas of members which the Membership Development Chair began using in his marketing efforts.
In my second post, I walked through some of the issues in the existing graphs that I had inherited, and showed how I fixed them in Microsoft Excel and ported them to the R language with a consistent look and feel, as well as how I began including them in IEEE Chicago Section presentations.
One of my early observations while first becoming familiar with the data is that every February membership declines drastically. After discussing with the team, I came to realize that these declines are a result of members going into arrears status after not renewing their memberships on time.
When looking at the initial graphs, one will notice that each year, total membership each January is lower than the prior year, with total membership in February following suit. One goal that the team initially verbalized is closing the membership gap with the previous year.
The problem with this goal is that because February declines are so drastic, it can be argued that the duration of each year is spent climbing back in terms of enrollment, either with new members or renewed members, and "closing the membership gap" is highly dependent on regaining previous position.
In other words, although the team is responsible for helping to increase new membership, and attempts to encourage members in arrears into renewing membership, "closing the membership gap" is not only a weighty task, but should really be seen from multiple perspectives, not just total membership.
Below are a couple examples of a new visualization that I began including in regular communication to the IEEE Chicago Section. These are based on the same data sets as the graphs I presented in my second post, but now start in February rather than January, based on a February baseline, and use a y-axis of percent difference from February.
The first example above demonstrates that even though a "membership gap" exists, in terms of percentage growth from the annual February baseline, 2013 is essentially in line with growth a few years ago. Now, as I mentioned to the membership development team, one needs to be careful with this type of analyses, because the data population for the number of years being investigated is rather small, and the number of members that comprises each February baseline is smaller than the previous year, resulting in a higher percentage growth for each member added in subsequent months.
However, a cursory review of the above example also shows that even though membership is declining each year, it does not mean that percentage growth automatically increases each year. If this were the case, 2013 would be at the top of the graph, and 2008 and 2009 would be at the bottom of the graph (the year 2007 was purposefully excluded because, as explained in my last post, it is the first year of data that is available for this analyses, and it is also a partial year that does not go back to February).
The second example below is for a specific member grade, senior members, rather than total membership. The situation is the same here as well. The years 2008 and 2009 might be expected to be at the bottom of the graph, but they are not. There exists significant overlap between years as well, although it needs to be kept in mind that senior members comprise a small slice of total membership. One curiosity that was discoverded with this visualization is that senior membership percentage growth drops between December and January, and anomaly that does not occur in any of the other membership grades.
This past month, I was particularly pleased with the results of my combination box plots and beeswarms. Because the R language ggplot2 package does not provide this capability, I used the standard R language box plot along with the beeswarm package. The beeswarm package provides the ability to group individual data points with different colors, which I assigned according to the color pallette used with the ggplot2 visualizations, although the consistent look and feel was largely lost.
In my first monthly presentation that used this visualization for member additions and deletions, I included the year 2007, but data anomalies associated with the first month of data (August) that has been made available by IEEE skew these graphs, so I removed this year from subsequent iterations. In addition, the February deletions mentioned above skew these graphs each year for total members to such an extent that I now provide separate plots that include and exclude February.
A perfect example of how removing February has a positive impact on readability are the plots of higher grade member deletions. In the first plot below, February deletions are far and above any other month, and the distribution for February across the 6 years that have been included is relatively narrow. Because of these consistent outliers, the data for the other 11 months is largely unreadable. The data points for the month of July, for example, appear to be the same across all years (note that when using a beeswarm, all data points are always plotted, and to enable visibility without data point overlap, data points are plotted next to each other). In the latter plot, the true distribution of July is made visible.
Pro Bono: IEEE Chicago Section Statistician – Part 1, Part 2, Part 4, Part 5