How is AI reshaping the way we live, create, connect, and evolve?
On June 13, Shared Futures: The AI Forum will bring together the cultural architects of our time to explore.
How is AI reshaping the way we live, create, connect, and evolve?
On June 13, Shared Futures: The AI Forum will bring together the cultural architects of our time to explore.
Workstream
Topic
Share
Building equitable products requires balancing a deep understanding of the people using the product with responsible data practices. Collecting demographic data can be challenging, but this primer offers a starting point for ethical data collection.
It provides a framework for product teams to address common challenges and risks, and to begin standardizing practices that are often difficult to align. While not an exhaustive resource on all aspects of data collection, this primer presents crucial considerations for data practices in advancing Product Equity.
This guide encourages collaboration across different sectors, including civil society, academia, and industry, while providing resources to support further learning and the development of industry-wide standards. To join this effort, contact the Product Equity Working Group at TechAccountabilityCoalition@AspenInstitute.org. For more information or resources, visit the Product Equity Hub.
This evolving resource is primarily intended for product teams—such as researchers, designers, product managers, and strategists—and data teams, including scientists, architects, and analysts, to guide their work on Product Equity and responsible data practices. Practitioners in compliance, security, legal, and academia may also find this resource helpful.
The examples and case studies are based on extensive research and consultation with Product Equity practitioners and civil society groups.
Responsible Data Practices Before Collection
Consider Potential Harms and Risks
Responsible Data Practices During Collection
To effectively build products for all, and especially for systemically marginalized communities, we need to deepen our relationship with the people using the products. Anyone engaged in product design and development needs to understand the diverse identities, contexts, and experiences that inform how people interact with the world—and with products. Product Equity is not only about eliminating and minimizing harm—it also seeks to introduce and enhance positive experiences, ultimately enabling people to thrive.
Doing so means intentionally creating an experience that empowers individuals to succeed, grow, and feel fulfilled in their interactions with a product, service, or platform. It goes beyond just meeting people’s basic needs or expectations, focusing instead on fostering positive, empowering experiences that help them achieve their goals, improve their well-being, and develop their potential. In the context of digital platforms, fostering individual thriving can include elements like personalized experiences, ease of use, accessibility, and support for long-term success, contributing to overall user satisfaction and loyalty.
To meaningfully engage with systemically marginalized communities and address their needs, we turn to data—because we can’t mitigate disparities without first understanding them.
Demographic data can be a powerful tool for assessing and improving equity and fairness in products, but there are multiple ways to analyze and apply this data once collected.
Here are some example approaches for using demographic data:
Companies already collect data about their customers, but that data is not always helpful for Product Equity purposes.
If companies do collect demographic data, it might be:
(1) collected in a way that’s not inclusive;
(2) segmented in a way that doesn’t tell us enough information about particular user needs and experiences; and/or
(3) collected in a way that makes the data unreliable for statistical analysis.
Responsible data practices require an eye toward harm mitigation. It also requires an acknowledgment that demographic data is not like other data. For example, multinational companies may be interested in improving experiences for individuals in the LGBTQIA+ community, but they also may operate in countries where identifying as LGBTQIA+ is criminalized. How might this company build inclusive products for this community without putting those individuals at risk of harm?
For further reading on the potential harms of demographic data collection:
McCullough, Eliza and Villeneuve, Sarah. Participatory and Inclusive Demographic Data Guidelines, pp. 5
Partnership on AI. “Appendix 5: Detailed Summary of Challenges & Risks Associated with Demographic Data Collection & Analysis.” Eyes Off My Data: Exploring Differentially Private Federated Statistics To Support Algorithmic Bias Assessments Across Demographic Groups
(1) Acknowledge historical mistrust and mitigate potential harms.
(2) Prioritize user rights, such as consent and ownership, user agency—the ability and power of individuals to make choices, control their actions, and influence their experience within a product—and the highest level of data protection laws available to your geographic scope or globally.
(3) Promote company transparency and accountability to the customer.
(4) Conduct regular audits and assessments to ensure data collection methods remain aligned with intended use cases and evolving product needs.
These actions represent cycles of iteration and review. A commitment to responsible data collection means not only establishing sound processes and principles but also actively seeking feedback and adapting to evolving laws, shifting user needs, and other relevant factors.
This primer gives an overview of the main concerns product practitioners and their teams should consider before and during demographic data collection. Our goal is to empower practitioners to collect data safely and effectively, making harm mitigation a foundational part of how products are developed.
This primer is not the only effort in the broader Product Equity space to address responsible data collection. Miranda Bogen’s report, Navigating Demographic Measurement for Fairness and Equity (May 2024), highlights the growing need for AI developers and policymakers to identify and mitigate bias in AI systems, emphasizing responsible data practices for measuring fairness.
There are ethical dilemmas associated with collecting and using data to make decisions that impact people’s lives. Responsible data practices are essential in addressing these challenges. For example, considering how an individual’s gender or marital status might impact their creditworthiness raises important questions about fairness and transparency, as well as the potential for discrimination based on data-driven insights and decisions.
Discriminatory practices and bias in data collection: Algorithms may inadvertently or intentionally discriminate against certain groups based on characteristics, experiences, and/or demographics due to incomplete data collection. For example, Daneshjou et al. (2022) documents how dermatology AI models trained on datasets predominantly composed of lighter-skinned patients performed worse at predicting skin diseases for patients with darker skin tones.
Discriminatory practices and bias in data-based decision-making: Biased data can perpetuate and even exacerbate historical and social inequalities. For example, in 2018, Amazon scrapped an AI-based recruiting tool because it favored men candidates over women. The tool was trained on resumes submitted over a 10-year period, most of which came from men, reinforcing the existing gender imbalance in tech. The AI system penalized resumes that included terms like “women’s” and downgraded graduates from women-only colleges. This also touches on the need for fairness-aware machine learning and algorithmic transparency to combat bias.
Unequal access and experiences: Companies use collected data to develop products and services that may not benefit all individuals equally. Early versions of facial recognition technology were less effective at recognizing people with darker skin tones, a problem rooted in the use of biased training data that lacked sufficient representation of darker skin tones. This created unequal experiences for some people and perpetuated the digital divide by making certain features less accessible to some systemically marginalized groups.
Due to the potential harms listed above, ask yourself and your stakeholders if data collection is the only way forward. Perhaps there are other ways of understanding your target or current audience, such as data you can infer by proxy or publicly available sources of data.
For further reading on algorithmic harms and responsible data practices, consider these books:
Defining your use case ahead of time helps protect against collecting more sensitive data than is necessary. Establishing clear use case boundaries can also reduce the likelihood of later misuse. If you don’t know why you’re collecting a category of personal identity data, chances are your customers won’t know either, which can damage credibility and trust.
*Note: Carefully consider the tradeoffs around this approach. For more on the limitations of synthetic and proxy data, see:
Rocher, Luc. “Misinterpretation, privacy and data protection challenges – putting proxy data under the spotlight,” 2022
McCullough, Eliza and Villeneuve, Sarah. Participatory and Inclusive Demographic Data Guidelines, pp. 9
To collect demographic data effectively, one needs to know how to define what data is needed and how the data will be used. For the purposes of this primer, we define demographic data as measurable traits of any given population such as, but not limited to, age, gender, and race.
We acknowledge there are many types of personal data, including demographic data. For more on defining personal data, see The Wired Guide to Your Personal Data (and Who Is Using It), 2019.
Before data collection, the practitioner’s goal is to understand how the data will be used in as much detail as possible. This practice may include any of the approaches outlined in the Using Demographic Data to Advance Product Equity section, including fairness testing. Fairness testing, however, is only necessary if it is a focus of your analysis, and will be explored further in the sections below.
As a best practice, we recommend drafting a pre-analysis plan before beginning data collection. This plan is a short document that details the following:
This pre-analysis plan is designed to establish a shared understanding among all stakeholders regarding safe and effective data collection and the subsequent steps.
The following sections detail each of the six components of the pre-analysis plan, outlining key considerations, promoting responsible data practices, and identifying relevant stakeholders.
Fairness testing helps advance Product Equity by ensuring that algorithms, models, and decision-making systems operate without bias. Product Equity strategies shape fairness testing by identifying areas where marginalized groups may face disproportionate impacts. Together, they promote inclusive and fair system design. However, statistical fairness testing alone cannot address algorithmic harms. Mathematical definitions of “fairness” often overlook systemic discrimination, racial capitalism, and the ethics of risk classification. For more, see Rodrigo Ochigame’s The Long History of Algorithmic Fairness (2020).
The first step in the process is to understand the structures, governance, and datasets already in place. Companies rarely start with a blank slate, and a practitioner should not dive into data collection without developing a sense of the landscape.
Data assessments should evaluate the impact of your company’s data collection practices on systemically marginalized groups to identify areas for improvement. Explore other data collection and retention policies that exist. Your team might determine that regulations such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) are not doing enough for your use case.
For example, see Meta’s approach, which involves secure multiparty computation (SMPC) to encrypt and split survey responses into fragments. These encrypted shares are distributed among third-party facilitators who aggregate the data without revealing individual responses. This approach allows Meta to analyze the data for fairness across different racial and ethnic groups without violating privacy.
Are there better models and/or approaches your company can adopt to ensure responsible data practices?
Minimizing harm by anticipating what could go wrong by collecting data is a top priority. Harms can arise from improper use, insecure storage, or data leaks, and they can manifest in many forms.
To address these risks:
Even if your company has established data retention policies and timelines, revisit those and decide if stricter policies are needed.
Tips for data retention:
To minimize risks associated with data breaches, regularly review and purge data that is no longer needed. Data leaks or theft can expose individuals to malicious actors, identity theft, financial loss, and other serious consequences. People who hold systemically marginalized identities are more vulnerable to these risks, and leaked data may expose these communities to additional negative—potentially lifelong—consequences.
Equity assessments and fairness tests must be aligned with the product being evaluated. If the product is designed to address a specific user experience, the focus should be aimed at measuring disparities within that experience. This, too, should inform the design of the assessment or fairness test.
The nature of the product also shapes the discussions that follow the evaluation. Findings from an assessment or fairness test do not exist in a vacuum; any burdens or benefits identified for one group need to be weighed against other considerations. If the goal is to modify the product based on the findings, product teams should set realistic expectations about what can and can’t be changed.
At this stage, the team should align on the specific types of fairness or equity that are the primary concern, ensuring that responsible data practices are integrated into the process.
For example, if the team is conducting an equity-focused assessment, they might consider questions like:
In contrast, a fairness test may be used to evaluate an algorithm. For instance, suppose a new algorithm is developed to decide whether an individual qualifies for a loan. The team might ask:
Gather input from a diverse range of stakeholder teams, including:
Engaging with these groups will help support a robust discussion about what fairness and equity means in the specific context of the product being tested.
Once the outcome to be measured is selected, the pre-analysis plan should specify which data points are required. The type of demographic data needed will depend on the dimensions being assessed—for instance, is the evaluation going to measure disparities by national origin, gender, race, or some other attribute? You will want to prioritize the dimensions most relevant to your product; there is no one-size-fits-all solution.
For example:
The team should also identify any existing data relevant to the assessment or fairness test. For example, in the previous loan scenario, past default rates or income might provide valuable context. The goal is to identify all data that might help explain a disparity.
When choosing attributes, consider intersectionality to address the nuanced experiences of people who belong to multiple marginalized groups. Without capturing intersectionality, there is a risk that these individuals’ experiences are lost in the aggregation of one of the identities. For example:
The level of granularity (or disaggregation) is another facet of acknowledging the nuanced experiences of your customers. When considering user groups, examine the level of granularity or coarseness a dimension of identity requires. For example:
Since many identity categories are based on social constructs—many of which are not created by members within those identity categories—it is important to apply responsible data practices to determine when granularity is necessary.
While identity data is valuable for measuring inclusivity, using such data to directly personalize user experiences without explicit, opt-in consent can be harmful. For example:
The decision on the level of granularity is not one-size-fits-all:
Deciding on the right balance between intersectionality and disaggregation should be guided by the size of your data set. This is often an iterative process. This resource from the Partnership on AI grapples with the privacy/accuracy trade-offs at length. It can help anticipate challenges during the analysis phase and inform the establishment of feedback mechanisms and monitoring practices to validate that your chosen trade-off yields useful insights.
This process ensures that the data collected is both relevant and compliant, setting a strong foundation for equity and fairness evaluations.
If the team is pursuing a fairness test, the pre-analysis plan should specify the statistical approach to assess fairness, once there is agreement on the type of fairness being tested, the groups of interest, and the relevant variables. For example:
Describing statistical tests in advance is considered the gold standard in the social sciences because it enforces discipline and mitigates hindsight bias. This upfront specification helps mitigate the risk of teams inadvertently manipulating testing by running repeated analyses to achieve more favorable results. Even in good faith, a team might adjust new variables after finding a problematic result.
A well-defined pre-analysis plan clarifies the statistical approach from the outset, highlighting any deviations from the planned approach. Responsible data practices ensure transparency in these decisions. Departures are often warranted as new things are learned or new data comes in, but being clear that these are departures from the plan is key.
With the data and testing approach clearly outlined, the pre-analysis plan should anticipate potential outcomes and remediations.
In the plan, write out all possible findings—e.g., group A does worse than group B; group A is equal to group B; group A does better than group B; and so on. For each of these findings, describe potential remediations, like retraining an algorithm, modifying the product, delaying the product launch, and so on.
All teams should weigh in:
The period before data collection should be devoted to outlining a thorough pre-analysis plan. In doing so, the team can coalesce around a strategy and answer all the relevant questions necessary to embark on responsible data collection.
Once you’ve determined your use cases, established policies, and built safeguards, consider user rights around transparency, consent, and ownership during the data collection process.
When planning data collection, consider adopting a co-design approach, which emphasizes collaboration with participants and product team stakeholders to create solutions that are both relevant and effective for those impacted.
Ideally, this process is owned by user experience (UX) stakeholders, such as UX researchers and UX designers. Co-design fosters direct engagement with communities, enabling teams to build with, rather than just for, these groups
Key elements of collaboration include:
Data collection should also encompass self-identification (self-ID), which allows people to share their demographic information and capture their unique backgrounds and experiences.
When selecting tools or vendors for data collection, confirm they align with your team’s privacy and security standards, as third-party platforms may have differing safeguards.
Additionally, determine whether data collection will be conducted in-person, online, within a product experience, or separately. Each method requires careful consideration of user rights, transparency, consent, and data ownership.
Transparency is essential in explaining the purpose for data collection and addressing common concerns such as privacy, data storage, and retention.
Consider creating a Frequent Asked Questions (FAQ) resource to answer key questions like, Who will have access to this data? and What data will they access? Use clear and accessible language so that all participants, including those with disabilities or limited English proficiency, can understand. Keep communications concise and link to more detailed policies as needed.
Legal review of company policies and user communication is recommended at this stage. Clearly articulate the value exchange—how data collection will benefit participants— as this fosters trust and helps them make informed decisions.
Informed consent is a critical process often facilitated through written documents, such as Non-Disclosure Agreements (NDAs). However, participants may overlook these materials or be overwhelmed by legal jargon. To promote genuine informed consent, include a verbal explanation in moderated data collection settings, outlining the research approach, data types, access permissions, and any sensitive topics that will be discussed.
Obtain affirmative consent for each element shared, and emphasize participants’ right to withdraw at any time without any negative consequences. To enhance understanding, consider using visual aids or videos, similar to Patient Decision Aids (PDAs) in healthcare, which help individuals make informed decisions based on their values and preferences.
Anonymity is another important consideration. Clearly communicate the actual level of anonymity participants can expect and their rights regarding data retention and retrieval. Participants should be made aware of the risks associated with sharing identifying information, from severe consequences like exposure to harm due to their identity or experiences, to milder yet still significant risks, like exclusion from systems. Never share personally identifiable information (PII), and ensure participants are aware of this from the outset.
Before launching externally, conduct internal testing with representative groups, such as employee resource groups (ERGs), to gather feedback and make necessary adjustments.
Always offer participants the option to opt-out or refuse participation in any user research or data collection process. Additionally, individuals should have the option to withdraw their data after it has been collected. This mechanism plays a crucial role in ensuring that data collection is ethical, respects privacy, and adheres to consent principles.
Deciding which groups to include in data collection should align with the predetermined research goals, intended use cases, as well as the groups you may be co-designing with.
It is important to also address potential biases in the dataset. You may aim for proportional representation to mirror the general population or over-sample specific groups for more granular insights.
For example, oversampling is vital for disaggregating data by race and generating meaningful results for smaller populations like American Indian/Alaska Native and Asian American/Pacific Islander groups, often overlooked in broad categories. Even if you’re targeting a narrow market, it is beneficial to include diverse perspectives to gain a broader understanding or uncover contrasting insights.
We hope this guide serves as a valuable starting point for product and data teams to adopt responsible data practices in advancing Product Equity. By using this resource, teams can better navigate the complexities of ethical data collection, minimize potential harms, and build more inclusive, equitable products. Future iterations will incorporate broader insights and case studies to further refine and strengthen this resource.
We invite you to collaborate with the Product Equity Working Group to help shape industry-wide best practices. Through collective effort, we can accelerate change and work toward a future where more people have access to all that tech has to offer.
If you are interested in joining the Product Equity Working Group, please email us at TechAccountabilityCoalition@AspenInstitute.org. For more Product Equity resources or to learn about the Product Equity Working Group, visit our Resource Hub.
Across the Product Equity resources published on Aspen Digital, we use the term systemically marginalized to describe individuals and communities who have historically faced systemic injustices, continue to face them today, or are newly marginalized due to evolving structural disparities. These forces shape how they engage with or are excluded from digital products.
We believe the language we use to talk about people is immensely important, alive and evolving, context-specific and connotative, and highly political. To ensure intentionality in our language, we carefully reviewed how companies, civil society organizations, and academic institutions are talking about vulnerable communities. Based on these considerations, we have adopted the term systemically marginalized.
As an organization, we aim for inclusion in our language and recognize that grouping such a diverse set of communities together into one umbrella term may cause some generalizations. We also aim for precision and will specify wherever possible to honor the unique experiences, histories, challenges, and strengths of communities. We hope this language helps us advance the emerging field of Product Equity and welcome the iterations and evolutions this term may take in the upcoming years of this work.
We extend immense gratitude to the members of the Product Equity Working Group, both past and present, for sharing their time and expertise over the past two years to develop this foundational guide on responsible data practices. Their perspectives were crucial in ensuring we approached this work with care and consideration. A special thanks to our Chair, Dr. Madihah Akther, who drove this initiative forward ensuring care and consideration in every step of the process.
Thank you to the subject matter experts who reviewed drafts of this primer and were immensely generous with their time and expertise:
This effort was led by the following team members at Aspen Digital:
Responsible Data Practices for Product Equity by Aspen Digital is licensed under CC-BY 4.0.