Susan Currie Sivek

Writer, researcher, hiker, knitter. Data Science Journalist for Alteryx, Inc. Former journalism professor. Curious about everything.

Jun 6, 2020
Published on: Towards Data Science
1 min read

Photo by Katherine Hanlon on Unsplash

This week’s episode of the Alter Everything podcast features a discussion about career paths and success. A key point in the discussion is the differences among genders in how people approach mentorship and asking for support or promotions at work — and how important it is to promote diverse voices and perspectives. With that issue in mind, I wanted to look further into available data on gender diversity within data science.

One source of information on the variety of people working in this field is the 2019 Kaggle ML & DS Survey, which asks professionals and students to respond to an array (data science pun intended) of questions about their gender, job title, income, nationality and more. (Race is not one of the included questions.) The survey data from 2017 and 2018 are also available, so this is a nice opportunity for longitudinal analysis.

I thought I’d take a closer look at the Kaggle data with Alteryx Designer to see if any notable patterns in gender emerged, particularly around students learning data science. After all, today’s students will help determine the future of diversity in this field.

Change Over Time

I first thought it would be interesting to see whether there had been any change in gender diversity among students over the three years of the Kaggle survey. I found that from 2017 to 2019, the proportion of student survey respondents identifying as female actually decreased just slightly. The number of respondents identifying as male and those offering other answers (which may have included “prefer not to respond” or “prefer to self-describe”) held steady. There hasn’t been a shift toward greater equity among genders, at least in these data.

Gender and Nationality

Which countries are producing the most female data science students? I looked at just the 2019 Kaggle responses and broke them down by nationality and gender. I then joined those data with countries’ population data and calculated how many data science students each country had per 10,000 people in its population — and how many were female. (Of course, these are just Kaggle survey respondents, presenting complications I’ll discuss more below.)

The 15 countries with the highest number of data science students are included in the chart below, with the number of female students displayed as well. Even in the countries with the largest numbers of surveyed students, the proportion of women is relatively low.

Gender and Education of Survey Respondents

Students of data science can pursue a variety of credentials. Again using the 2019 survey dataset, I examined differences in education levels among non-student respondents by gender. A slightly higher proportion of female respondents had either master’s degrees or doctoral degrees than did male respondents.

Though the difference isn’t enormous, it’s interesting to consider the potential reasons for it. Are women simply achieving advanced degrees at higher rates than men? Are women with higher degrees more likely to complete a survey, for whatever reason? Are women held to a higher educational standard by hiring managers in order to obtain their positions?

Data Limitations

Using Kaggle data to look at these issues isn’t an ideal approach, but it raises some interesting questions in itself. This survey was voluntary and (as far as I can tell) was provided only in English, even though respondents came from 171 different countries, so participation was limited. Additionally, Kaggle is in large part a competition website, where users respond to various challenges to prove their data mettle. That format may not be equally inviting to data science students and professionals of all genders and backgrounds.

Asking survey respondents about gender is itself difficult, and my own analysis here is flawed because “other responses” (as I grouped the “prefer not to say” and “prefer to self-describe” responses) includes all other responses to the gender question on the survey. Additionally, answer options for that question differed on the 2017 version of the survey.

Takeaways for the Future

Other data from national and global educational and professional institutions would offer additional insights, and perhaps more reliable ones. But as we explore paths to achieve greater diversity in the data professions, it’s interesting to observe these patterns and consider how to address them.

For those seeking a path into the data professions, the Alteryx ADAPT (Advancing Data & Analytic Potential Together) Program is a free online training opportunity that includes a software license for use in the program, collaborative discussions, data science resources, and certification. The program is available to anyone whose employment has been affected by the COVID-19 pandemic, including people who are unemployed or furloughed, or who have lost internship or post-graduation opportunities. Topics include an introduction to data fundamentals, Core Certification for Alteryx Designer, and predictive analytics for business. The program’s self-paced structure is well suited for people from varied backgrounds — including those who may have found it challenging to pursue other upskilling opportunities.

It will take all of us working together and supporting one another to improve diversity in the data professions. For more insights and firsthand experiences on how we can support each other’s efforts in this area, check out this week’s Alter Everything podcast episode.