Why “data for good” lacks precision. – Info Gadgets
I just returned from a fantastic week in Stockholm attending International Conference on Machine Learning (ICML) 2018. One of the most active informal communities at ICML was the “data for good” community. We organized a few spontaneous lunches where I met some incredible researchers and applied practitioners. However, our discussions as a group made me revisit a gut reaction I have had for awhile that “data for good” has become an arbitrary term to the detriment of the goals of the movement.
“Data for good” says little about the tools being used, the goals of the endeavor, or who we are serving. It is similar to the frequent use of “AI” to describe everything vaguely related to machine learning. Both are terms that are exciting in their use and general appeal but lack precision from the perspective of technical practitioners.
I accept that “data for good” is a useful shortcut for talking to a broad audience (I also use it when talking to a general audience, as here in my twitter bio). My concern is about the lack of precision when we talk about data for good amongst ourselves. This post is aimed at technically-skilled individuals and organizations who opt-in to the “data for good” umbrella.
We need more rigorous language to describe the work we are doing and most importantly so that we can also identify where we can do better. In this post, I’ll describe the key criteria frequently used to qualify an initiative as “data for good.” I will use this rough taxonomy to discuss some open challenges. Finally, I will suggest that the most important bucket, machine learning education programs, is rarely discussed and is painfully underserved by our community.
What is data for good?
“Data for good” is a very new term. I am a researcher at Google Brain. Four years ago, I started a 501(c)3 non-profit called Delta Analytics. At the time, I was only aware of a handful of other organizations that had self-categorized under the “data for good” umbrella (for example: DataKind, Bayes Impact, and Data Science for Social Good).
In parallel, there is a momentous amount of interest in this area. I am often asked to speak on panels which have long waitlists, the number of applicants for the Delta fellowship grows each year, and many new organizations have since formed such as AI4ALL, hack4impact, and uptake.org. To leverage this incredible outpouring of interest in an efficient manner, we need a better framework for discussing our work and the areas that deserve most attention.
Let’s start by asking what do we mean by “data”? I will constrain the scope of our discussion by defining “data” as referring to a project that extracts information from an existing dataset or involves the collection of new/additional data. This often entails data collection, cleaning and/or the application of statistical tools and/or machine learning models. This work can also involve building technical tools for data collection or model deployment.
“Data for good” refers to a subset of data projects. “Data for good” is an odd descriptor because it implies that some data is not being used for good or is at least ambivalent in the nature of its application. The subjective nature of the word “good” as a qualifier means that there may be multiple valid definitions used at the same time.
I have frequently seen four criteria used to qualify a project as falling under the “data for good” umbrella:
1. The end recipient of the data product is a non-profit or government agency.
2. Skilled volunteer/s develop and deliver the data product.
3. Data tools are provided to the organization/individual for free or at a heavily subsidized amount.
4. Educational training to improve the data skills of an underserved community
While this is a crude taxonomy, it is also a useful starting point for a more rigorous treatment of each bucket. In language, whenever there is ambiguity about the meaning of a term, it is important to clarify which definition is being used. Unless we articulate these definitions, it is difficult to have a rigorous conversation about whether we are prioritizing initiatives in a valuable way. As a community, we need to move away from self-congratulatory forums and instead have a candid conversation about these trade-offs, which I discuss in the following few sections.
1. Skilled volunteer/s develop and deliver the data product (for free or at a subsidized rate).
Delta Analytics is an example of an organization that calls itself “data for good” in part because we connect technical experts who volunteer for free to data projects all over the world.
Skilled volunteering is a powerful way to bridge the severe technical gap that exists between the expertise at tech companies and universities with the rest of the world. Ideally, volunteers work on problems that are underserved. This may be because of the nature of the problem or due to insufficient technical expertise within the host organization and/or geographic region.
While skilled volunteering is a powerful way to bridge the skill gap, we must also take note of the shortcomings of using volunteers to advance data for good.
Depending on volunteers may lead to sporadic and often unpredictable progress. Volunteers may be juggling multiple engagements and have to prioritize paid responsibilities/families/down time first. Volunteers may only be available for a limited amount of time, which may require trading off the project between volunteers who have different timelines.
“Cutting-edge” problems are favored. The easiest projects to get volunteers excited about involve interesting technical challenges. For example, last year I volunteered with three other Delta fellows on a project with Rainforest Connection to detect chainsaws using audio streamed from recycled cell phones. The problem was fascinating because of data scarcity, the difference between the train and test distribution of the rainforests we deployed in, and the engineering challenges involved.
It is easy to attract highly skilled volunteers to detect illegal deforestation using deep learning. However, 99% of all data problems are not as flashy but still deserve our attention. Most problems involve extremely little data and instead require insights into data cleaning best practices and estimating uncertainty given small sample sizes. In fact, often what is most needed is help figuring out what data to collect in the first place.
Why not just work on these problems? The impact of contributing a solution to any of these problems is far reaching. However, the time frame required to work on problems of this nature (such as collecting the right data) is often simply not a fit for a volunteer who can only contribute a few hours a week.
Not all volunteering engagements are created equal. Weekend or one-day hackathons aim to connect time constrained volunteers with non-profits that need their help. Delta has never hosted a one-day hackathon; we prefer to focus on 6-month engagements with non-profits. We do not do one-day hackathons because it takes an incredible amount of effort to make it a worthwhile endeavour for the end recipient of the data product.
Often, preparing for the hackathon carries a significant overhead for the non-profit that is not justified by the data prototypes that emerge at the end of the weekend. The non-profit and hackathon hosts have to invest significant time in preparing the data documentation and dataset cleaning necessary for attendees to quickly orient themselves with a new database.
Non-profits are resource constrained and there is rarely a dedicated “data” person. Unless hackathons are carefully scoped and meticulously planned, the data products produced rarely justify the time invested. That being said, often the goal of the hackathon is to serve as a taster event for skilled volunteers who may decide to become more regularly engaged. In this case, success depends upon having a clear series of desired outcomes for continued engagement. These should be defined before the hackathon, so that the event itself serves as a useful stepping stone for more long term engagement.
2. Tools for data work are donated or heavily subsidized.
This category can often be the most problematic characterization of “data for good.” Almost all large tech companies have programs for non-profits that provide hardware, licenses, and computational resources at a heavily subsidized rate or for free. While these efforts are well intentioned, the current formulation of most efforts feels painfully tone deaf. Here is why:
The tools that companies are most eager to donate are not suited to the vast majority of non-profits which have limited data and technical expertise. We forget that the vast majority of organizations still use Excel and consider moving their data to Salesforce to be a big technical step. Providing cloud credits, hardware or expensive visualization licenses for free is useful to an extremely small group of organizations. However, it is problematic when companies equate this type of very specialized donation with having a “data for good” program. At the very least, these initiatives should have dedicated support and training for non-profits who often lack technical expertise and/or are using software in unusual and unanticipated ways.
Equating in-kind donation with “data for good” absolves tech of the responsibility for more meaningful participation. Most non-profits will tell you that their most common pain point is not software but technical training. The largest contribution of organizations like Delta Analytics is not technical innovation but rather empowering nonprofits to have more confidence to use their data. Tech companies can have the most impact by pairing access to in-kind resources with admittedly more costly but meaningful initiatives to provide education outreach and skilled volunteers.
3. A non-profit or government agency is the recipient of the data product.
An initiative is often classified under “data for good” if the end beneficiary is a non-profit or government agency.
While this may feel like an intuitive way to categorize a project as “data for good,” it is important to remember that this will not always guarantee we work on the most meaningful questions. We should be flexible and prioritize identifying impactful questions. For example, every year Delta receives applications for our grant recipient program from non-profits around the world. We never accept applications that involve helping non-profits prepare data for grant proposals. Is this type of data work useful? Almost certainly. Are there more impactful questions to work on? Absolutely. Given limited resources and expertise, we must triage how we allocate our resources.
A triage-based approach would not ignore social impact organizations that are for-profit. One of the most meaningful projects I worked on as a Delta volunteer was with Eneza Education in Nairobi, Kenya. Eneza is a for-profit organization that uses pre-smart phone technology to deliver quiz-based resources for primary and secondary students preparing for end of year exams.
Eneza had an incredible unparalleled dataset of how students are learning across East Africa. Moreover, the Eneza team agreed we could share these insights publicly with a wider audience. What emerged was a rich picture of how students are learning, how to optimize the sequence of quizzes to retain students, and how families are using pre-smart phone technology in innovative ways in Africa.
Finally, a very exciting and growing body of research initiatives are not even aimed at individual organizations. Instead, researchers aim to produce generalizable insights for underserved domains. For example, the UN pulse lab in Kampala uses satellite imagery to estimate regional poverty using features like the material used to construct roofs. The Sustainability and Artificial Intelligence Lab is engaged in multiple projects that include predicting poverty using satellite imagery and using remote sensing data to predict crop yields. We should encourage more institutional support, like the Einstein grants from SalesForce, that makes this type of research possible.
4. Educational programs that aim to build technical capacity in underserved communities.
Education programs fall under “data for good” when they are focused on an underserved community that do not have alternative training programs. For example, Uptake.org runs a program in Chicago training non-profit professionals about machine learning and security practices. Delta teaching fellows taught an introduction to machine learning course in Nairobi, Kenya and will be teaching later this year in Agadir, Morocco.
At the beginning of this post, I suggested that education programs are both the most important “data for good” initiative as well as the most underserved.
Why? Because skilled volunteering is inherently a short term endeavor. While we absolutely need skilled volunteering, in parallel we need more educators. This is the difference between a data product that is “handed-off” at the end of an engagement and a data product that has buy in and will be used in the long term by the end recipient.
Equally important, problems benefit from local experts. Skilled volunteers often suggest inappropriate solutions because it is hard to forget our preferred everyday toolkit. As a researcher at Google Brain, I don’t normally worry about data quality or quantity because I rely on a few large, public, and clean datasets. However, the vast majority of real world problems do not involve that much data and do not require neural networks. I would like to think that I am not locked into my most recent research, and that I am still able to provide value when I volunteer my skills for very different problems. However, what if someone was thinking about problems under those very different constraints every day? Would he or she be better placed to come up with an innovative solution? At the very least, our solutions might be quite different.
The reason why education is an underserved area is not because we don’t care. We do! It is because skills training is fundamentally a more difficult problem than bridging the skill gap. To build capacity, we must build ecosystems capable of doing the same. This is challenging but not impossible. Earlier this summer, I visited Kenya to teach a tutorial at Data Science Africa. On my way back to San Francisco, I stopped at Andela. I have visited Andela every year for the last three years. Andela is an engineering power house in Africa with campuses in Lagos, Nairobi, and Kampala.
It is widely perceived to be one of the best ways for an engineer to place with companies outside of Africa. In fact, some students who have already completed their undergraduate in computer science still join Andela, despite the fact the program specializes in training engineers with no assumed background knowledge. These students join because Andela employs engineers to work with companies all over the world. Andela is incredibly successful because it has recruited talent based upon the assumption that brilliance is most important than prior experience. It relies upon an extremely low acceptance rate to select the most promising developers in Africa.
Another example of a huge capital investment effort is the recently announced creation of African Masters in Artificial Intelligence (AIMS), co-sponsored by Facebook and Google. Andela and AIMS require large capital investment and support from institutional partners. However, at a smaller level we can all push for integrating education into our efforts. Fast.ai offers remote diversity scholarships for each deep learning course they teach at the Data Institute. Data science Nigeria will host a conference in November where machine learning experts in the US and Europe will teach tutorials over google hangout.
When we partner skilled volunteers with non-profits, we should do a mental check — “ are the tools we are using sustainable for re-use by the non-profit once we are no longer in the picture?” Most importantly, we should not be afraid of teaching. I often encounter hesitation from very technical individuals about whether they are qualified to be a teacher. You should reverse the question, and ask what would disqualify you to be a teacher. The world is big and knowledge is currently concentrated in a handful of cities and individuals. We must all play our part as educators and mentors.
Parting thoughts.
I started this note by suggesting that how we talk about “data for good” is imprecise. I surveyed some common criteria used to qualify a project as “data for good.” One possible motivation for using more precise language is to have a common discourse to efficiently channel the immense excitement, energy, and resources that follow the term “data for good.” Another, perhaps more important reason, is to hold ourselves accountable by reflecting upon whether our current efforts are best placed at serving communities around the world.
Acknowledgements
Thank you for the rich set of feedback on this article from Melissa Fabros, Brian Spiering, Simon Kornblith, Anna Bethke, Jonathan Wang and Kumar Agrawal. In particular, I want to thank Amanda Su for many useful soft edits and for reading multiple iterations of this article.
Article Prepared by Ollala Corp