Reclaiming AI for Development

Benchmarking What Matters

An illustration of hands measuring an open UN Sustainable Development Goals color wheel.
August 20, 2025

AI is accelerating fast, but so far, most of its power has been optimized with a narrow set of goals that, in many cases, do not reflect public needs and aspirations. As with past waves of innovation, metrics of success are crafted to achieve specific outcomes—often business-oriented—that ultimately set the trajectory of technological evolution. This logic of measurement has long shaped not only innovation but also international development, where similar tensions between measurable outcomes and public needs have persisted. Today, AI benchmarks risk reproducing those blind spots, but this can be avoided with deliberate coordination and community engagement.

This piece explores how accumulated knowledge about measurement and development can help steer AI toward public needs, and how AI, in turn, can contribute to addressing development goals more systemically.  Rethinking what we measure—and how we measure it—may be key to advancing AI for development. Building on historical patterns and structural incentives, the report argues for a shift toward community-aligned AI benchmarks that bridge innovation with lived realities, delivering outcomes that are not only measurable but truly transformative.

Few fields have embraced the language of measurement as wholeheartedly as international development. From sweeping global strategies to modest pilot projects, everything is planned, executed, and evaluated within a grammar of verifiable outcomes. This logic shows up in development goals and indicators, and is sprinkled through different instruments. The very definition of what counts as “success” tends to be quantified.

This way of structuring action solidified in the 1970s and 1980s. As the developmentalist model—which prioritized state-led industrialization and social equity over short-term efficiency—fell out of favor, and the role of the state retreated under economic pressure, new modes of global governance took root, prizing results-based management. Development, especially from the perspective of the Global North, came to be seen less as a long-term political process and more as a sequence of targeted interventions.

Tools like the Logical Framework Approach (LFA) gained prominence. First introduced by the US Agency for International Development (USAID), LFA offered a deceptively simple structure: ideas became matrices, purposes turned into indicators, and complex problems were reduced to lines of action. The approach was soon adopted by bilateral agencies like the Canadian International Development Agency, the German Technical Cooperation Agency (GTZ), and also by several United Nations programs. By the 1990s, it became a formal requirement in European Union cooperation programs, and eventually permeated the project cycles of multilateral development banks like the World Bank or the Inter-American Development Bank (IDB), consolidating a common language of planning and monitoring that remains dominant today.

By promoting common structures and clearer expectations, measurement frameworks have sharpened objectives and enabled comparisons across vastly different contexts. Indicators can turn broad ambitions into actionable targets helping guide decisions. This discipline has led to more structured coordination and a greater capacity to track whether commitments are being met.

But in doing so, the system has also reinforced a narrow worldview where definitions of success tend to reflect the values of funders rather than the lived realities of communities. Complex systemic interactions and dynamics are routinely overlooked. Numbers become stand-ins for real impact, and anything that doesn’t fit neatly into a results matrix is often sidelined or ignored. By deciding what gets counted, these indicators also decide what counts, in turn defining the problems we see and the solutions we pursue.

The indicators are met, but the real goals often aren’t. Examples are hard to miss. Many countries raised school enrollment rates to meet global education targets, only to find that, well before the pandemic, more than half of 10-year-olds in low- and middle-income countries couldn’t read and understand a simple text. Today, that figure is nearing 70%, and the number of children and youth out of school globally has barely budged, down just 1% in a decade. In health, more patients may be treated, but care quality stagnates. The development sector’s obsession with quantifiables has bred a culture of box-checking, where what matters is looking good on paper. Even the World Bank now acknowledges that ending extreme poverty by 2030 is no longer a viable goal. Around 700 million people still live on less than $2.15 a day, and progress has slowed to a crawl

The paradox is striking: despite mountains of evaluations and “lessons learned,” few are scaled or translated into real-world impact, and many reports are not even read (thank you, by the way, if you’re still here). At best, systems “teach to the test” and succeed in improving the chosen indicators—a meaningful gain, though often insufficient to address deeper structural challenges. At worst, data is massaged to avoid scrutiny, feeding a culture of box-checking rather than transformation. Solutions exist but they rarely overcome the political and operational bottlenecks that prevent their broader adoption. And if a solution fails to reach the people it’s meant to help, can we really call it a solution?

The lesson, therefore, is not to abandon measurement, but to rethink it. Indicators can be a powerful scaffold for development, but without community ownership to fill in the structure, they risk standing as an empty frame. A measurement culture oriented this way could transform indicators from static reporting tools into dynamic instruments for learning and adaptation. The information itself must be converted into innovative mechanisms that enhance the effectiveness of implemented solutions. By capturing patterns across contexts and over time, new frameworks could help turn “lessons learned” into lessons applied, enabling progress that is both measurable and meaningful.

The difficulty of achieving systemic change extends beyond development. Too often, technologies have relied on narrowly set targets, with little regard for the broader social and cultural systems into which they were diffused, constraining their potential to drive meaningful transformation. 

The history of innovation offers plenty of examples. Consider the invention of the cotton gin in the United States in 1793. By rapidly separating cotton fibers from seeds, the machine transformed agricultural productivity and turned cotton into the country’s leading export. Yet its broader impact was far from uniformly positive. Rather than reducing reliance on enslaved labor, as some anticipated, the cotton gin made slavery more profitable, fueling the expansion of plantations across the South and entrenching the institution at the heart of the American economy. The economic gains were undeniable but deeply uneven: technological progress reinforced existing social and economic hierarchies, locking in patterns of exclusion that shaped U.S. society for decades to come.

Just as mechanization and electrification defined transformative waves in the 20th century, today we stand amid another revolution: AI. Every week seems to bring a new model that’s smarter and more jaw-droppingly powerful than the last. The hype is everywhere, in academic papers, investor decks, corporate manifestos, and, of course, on your LinkedIn feed. Like fire or electricity (or most innovations), AI is not inherently good or bad. It’s a force in search of direction. There are inspiring use cases, like AlphaFold’s medical breakthroughs in protein structure prediction, which is helping researchers develop treatments for cancer and other complex diseases, or grassroots “AI for Good” efforts, but these remain isolated projects driven by individual vision or institutional will. 

What’s missing is an intentional, coordinated cross-sector strategy to embed AI for development in a way that is scalable and systemic. Like previous waves of technological change, AI will not steer itself—it requires deliberate coordination to ensure its power is harnessed for broad and inclusive progress. The core of the coordination could lie in benchmarks.

Technically, an AI benchmark is a dataset with metrics associated with performing a task, like multiple-choice questions or identifying objects in an image. The standardization allows for easy performance comparisons across models. It also fuels an arms race of leaderboard rankings. And rankings matter. Finishing at the top of a benchmark can mean prestige and investment for AI researchers and developers.

AI benchmarks act as both compass and currency: they steer research and signal what kinds of “intelligence” are worth pursuing. Since the early days of deep learning, these benchmarks have set the direction of the field. ImageNet, launched in 2010 with millions of labeled images, prompted a deluge of models designed for recognizing the content of images. In natural language processing, GLUE (2018) and its successor SuperGLUE (2019) aggregated tasks like sentiment analysis, question answering, and textual inference to produce an overall score for language model comparison. 

This dynamic creates an economy of attention and prestige that shapes research priorities. Models are trained for what’s measured. Everything else becomes an afterthought. Like development indicators, AI benchmarks play a structuring role: they decide which tasks matter, which datasets get built, and what progress is celebrated. 

On the positive side, by offering standardized, quantitative targets, benchmarks allow researchers to compare models under controlled conditions, pushing the boundaries of performance. Many landmark achievements, like the success of deep convolutional neural networks in image recognition, were catalyzed by massive jumps in benchmark scores.

But the framework also comes with its share of drawbacks. One of them is that most of today’s dominant benchmarks were built in—and for—only a narrow set of contexts, which are rich in data, primarily English-language, and require privileged access to compute. And if a model performs well on a test set, it might not mean much in the wild, where data is noisy and potentially very different from the examples used in testing. Therefore, what may seem like a harmless geographic or linguistic over- or under-representation in a data sample can become a systemic distortion that, if left unchecked, will exponentially amplify today’s disparities over time. 

Additionally, in chasing this vision, benchmarks often end up fostering what some scholars call the “technology illusion,” where tools like algorithms and platforms are mistaken for goals in and of themselves, rather than instruments serving broader agendas. As a result, the development of AI risks becoming self-referential, optimizing for technical performance while losing sight of the real-world problems it could help solve.

Perhaps most importantly, often the tasks evaluated by AI benchmarks were not chosen to meet the world’s most urgent needs. They were selected for reasons that, in general, aligned with the business incentives of those funding the research. Examples include being easy to measure for a specific task, offering quick wins that are attractive for investor relations, leveraging datasets that are readily available in English, or just fitting a narrow vision of “human intelligence” in pursuit of Artificial General Intelligence, or “AGI”. 

The result is predictable: communities in the Global South, along with marginalized groups in the Global North, are left out of the optimization loop. Their needs and culture are underrepresented or outright invisible in prioritized benchmark datasets. And if an important task, like predicting crop resilience or diagnosing disease in undeserved regions isn’t represented by a benchmark, it simply doesn’t exist in the incentive structure.

At Aspen Digital, we’re building an initiative that sits right at this critical crossroads: Community-Aligned AI Benchmarks. Its goal is to foster the adoption of AI benchmarks that act not just as technical tools but as strategic instruments to align research with real public priorities. This approach offers a concrete pathway to advancing AI for development.

Acknowledging that current benchmarks often fail to reflect the world’s most urgent development challenges, this initiative focuses on those captured in the Sustainable Development Goals (SDGs). It begins where the gap is most glaring, and the need most urgent: food security (SDG 2), the number-one public concern globally according to multilateral surveys. (To learn more about our research process, see Intelligence in the Public Interest.)

By centering the expertise of food systems practitioners and affected populations, the initiative works to avoid the pitfalls of previous development and AI metrics and instead identify concrete, high-impact problems that AI could be developed to solve. A global survey is currently underway to surface these priorities, and we’re actively inviting others to help share it and contribute to expanding its reach.

The next step is to translate the insights gathered from this survey, findings from systematic reviews by international development organizations, and workshops and conversations with local stakeholders, into concrete benchmark tasks, co-designed with AI developers. These benchmarks will be technically rigorous, but socially grounded, helping steer innovation toward real-world utility and establishing a credible foundation for AI for development.

The logic of metrics has long shaped both international development and AI, determining which goals get attention and which strategies are pursued. Yet, as we have seen, when these systems operate in isolation from the broader realities they aim to influence, they can deliver impressive numbers while falling short of meaningful change. 

Community-Aligned AI Benchmarks offer a tangible, potent methodology to connect these two measurement cultures into a more responsive, integrated architecture for advancing AI for development. Crucially, benchmarks often serve as pre-design indicators rather than post-deployment evaluations, meaning that by shaping them upfront we can influence the direction of AI systems from the outset and not merely mitigate risks or correct biases after they emerge.

AI allows us to rapidly systematize the extraordinary record of information that too often sits unused in reports and archives. When processed through well-designed benchmarks, this body of evidence can be transformed into structured, regularly updated reference points that inform both policy and technology design. This would make it possible to draw on decades of accumulated knowledge at the speed and scale required by today’s development challenges.

The depth and contextual richness of international development expertise can, in turn, serve AI itself. By embedding these insights into benchmark design, innovation can be shaped in real time by the priorities and values of the communities it is meant to serve. This two-way exchange, AI amplifying the reach of development knowledge, and development grounding AI in real-world priorities, offers a path toward systems that are both technically rigorous and socially meaningful.

Creating clusters of benchmarks can redefine how we use AI for development. By grouping related metrics that together represent a larger, systemic problem, these clusters would guide AI researchers toward creating applications capable of addressing root causes rather than isolated symptoms. A “food security cluster,” for example, could combine benchmarks on agricultural productivity, supply chain efficiency, climate resilience, and nutrition outcomes, pushing AI systems to design solutions to capture the interdependencies that determine whether people actually have enough quality food to eat.

Beyond shaping AI design, clusters could transform how policy decisions are made. By capturing interactions across sectors, they would allow governments and development agencies to simulate the ripple effects of proposed interventions before implementing them. This means that an AI for development tool trained on a cluster could test, for example, how a new agricultural subsidy might affect market prices, nutrition, and environmental resilience, turning benchmarks into engines for policy.

Importantly, building this capability does not require starting from zero. Across much of the Global South, some of the foundational components are already in place: national statistical systems capable of generating reliable datasets, open data platforms that enhance transparency, digital transformation agendas that set long-term priorities, and public research institutions (and people) with deep local knowledge. These assets represent a critical starting point for embedding development priorities into AI benchmark design. 

But mobilizing these existing assets calls for catalytic funding and stronger institutional coordination. Gaps persist in infrastructure and regulatory capacity. These limitations constrain the ability to design benchmarks that fully align with national development priorities. 

Since the incentive structure behind dominant benchmarks leaves little room for deviation, there is little reason for researchers or companies to look elsewhere. Shifting this trajectory will not happen organically, it requires deliberate intervention from actors with the authority and capacity to expand the terms of the game.

From the early days of the American Defense Advanced Research Projects Agency’s (DARPA) foundational AI research, to mission-driven projects like the Internet and GPS, to emergency-driven breakthroughs such as COVID-19 vaccines, coordinated multi-actor efforts have repeatedly proven their capacity to steer innovation toward public benefit. And this is not a phenomenon limited to the Global North: Argentina’s ARSAT program and Brazil’s Amazonia-1 satellite show how nationally led initiatives, grounded in strategic vision and alliances, can deliver transformative technologies. What unites these cases is the ability to forge effective public–private partnerships and sustain transnational cooperation to build enabling ecosystems from the ground up, a quality that AI for development urgently needs.

Today, multilateral development banks, with their experience in building cross-border public goods, are well positioned to lead the challenge. They can bring together governments, firms, universities, research consortia, and local actors to co-develop benchmark frameworks that are locally relevant yet globally interoperable. By embedding these initiatives within broader development strategies, these institutions can help ensure that AI innovation serves the communities it is meant to benefit. 

What is distinctive here is the proposal to shape systems from the outset rather than relying on after-the-fact fixes or building one-off AI solutions for each individual project. Many development institutions have already been promoting AI for development for years—to enhance internal efficiency and to design concrete, project-specific interventions—but this could be a channel to embed AI for development strategies in a more systemic way, 

There’s a range of financial instruments that, if deployed strategically, could accelerate both the creation and the adoption of benchmark clusters. These are not abstract ideas; many of these tools are already in use for other policy priorities and could be adapted to the frontier of AI for development:

  • Policy-based loans, for example, have long been used to support fiscal reforms and institutional upgrades. In the AI for development context, they could be retooled to build the regulatory capacity needed to oversee benchmark-driven innovation. By tying disbursements to concrete governance milestones, these instruments can ensure that technical adoption goes hand in hand with institutional readiness.
  • Blended finance offers another powerful lever, combining public, private, and philanthropic capital to de-risk innovation in areas where market incentives alone would fall short. Initiatives like the Sustainable Development Investment Partnership have shown how blended models can mobilize large-scale investment for infrastructure aligned with the Sustainable Development Goals. Applied to AI for development benchmarks, such structures could bring in investors and public institutions to co-finance systems that address complex development challenges while distributing both the risks and the benefits.
  • Results-based financing takes a different approach, linking funding directly to verified outcomes. Already tested in sectors such as health and water, this model could be used to reward the measurable alignment between AI deployments and development outcomes, for instance, documented improvements in clusters of nutrition indicators, education access, or climate resilience. 
  • Guarantees and targeted investment loans can also help close the gap between promising prototypes and large-scale deployment. By reducing the perceived risk for private investors, these instruments can attract capital to AI for development projects that would otherwise struggle to reach market viability. The European Investment Bank, for instance, uses such guarantees to unlock financing for sustainable infrastructure, an approach that could be mirrored in the AI and development space.
  • Finally, thematic funds and coordinated multilateral financing could provide the scale and continuity needed for systemic change. The Climate Investment Funds, for example, have demonstrated how pooled resources from multiple development banks can back transformative technologies in renewable energy and resilience. A similar mechanism, focused on AI for development, could ensure sustained funding for benchmark development, across multiple regions, with contributions and governance shared among participating countries.

Deploying these instruments effectively will depend on strengthening the institutional and technical capacity to govern benchmark clusters. Without the ability to generate and maintain relevant, high-quality benchmarks, even the most sophisticated financing tools will fail to deliver their intended impact. This makes capacity-building a central pillar of any AI for development strategy.

To advance meaningfully, what matters most is not only the invention of new tools, but the clarity of purpose guiding them. Means gain value only when they serve ends worth pursuing. Policies and instruments long tested can, when guided by renewed intention, transform the digital shift into a force for inclusion rather than a catalyst for new divides.

Fostering innovation and prioritizing human development are not contradictory goals. Democratic societies are built on a conception of freedom that prioritizes the development of individuals within the framework of an organized community. A fundamental maxim of these societies is that “the principle of scientific autonomy is complemented by the principle of responsibility,” ensuring that progress serves the public good as well as technical advancement. When AI is anchored in community priorities and guided by robust benchmarks, it becomes a lever for innovation that is both sovereign and resilient. 

The core of this proposal lies in building meaningful alignment between what AI measures and what truly matters for development. Advancing AI for development doesn’t require more AI experts, it requires more domain experts who can understand the logic behind this technology and bridge deep contextual knowledge with technical tools, ensuring that innovation is grounded in real-world priorities and capable of addressing systemic challenges.

This is an opportunity to reimagine not only what we measure, but why we measure it, how we measure it, and who gets to decide. Without intentional design, AI risks replicating the exclusions and blind spots that international development has spent decades trying to overcome. With strategic investment and knowledge deeply rooted in communities, it is possible to build a generation of AI systems based on community priorities and aligned with the public interest.

AI benchmarks are shaping the frontier of innovation; aligning them with human priorities can mark a milestone that expands the frontier of development.

Thanks to B Cavello, Eleanor Tursman, and Jacob Wentz for their support and contributions.

Francisco Jure's headshot.

Francisco Jure is an economist working at the intersection of technology, innovation, and development through the lens of political economy. He is a MAIR candidate at Johns Hopkins SAIS, where he serves as a Public Service Fellow, President of the Latin American and Caribbean Studies Club, and participated in the White House’s Trilateral Technology Leaders Program. He is currently a Google Public Policy Fellow at Aspen Digital, working on the intersection of AI and development, and contributes to quantum technology research at the European Centre for International Political Economy. Previously, he worked with multilateral institutions such as the Inter-American Development Bank and CAF – Development Bank of Latin America, served as Director of Strategic Priorities for the Government of Argentina, and advised in the legislature of Córdoba Province.

Browse More Reports


Intelligence in the Public Interest

Aspen Digital is leveraging public interest AI benchmarking as a new way to give communities a voice in AI research priorities.


Second and Third Order Effects of A.I.

When imagining the effects of AI, we look ahead several decades, beyond first order impacts, to ensure these tools benefit all of humanity.