About
Hello there! I’m Kehang.
I obtained my Ph.D. at Harvard University in 2026, jointly advised by Prof. John Horton from MIT Sloan’s IT group and Prof. David Parkes from Harvard’s EconCS group. I had the pleasure to intern at Google DeepMind and Amazon Science.
I study AI agents as proxies for human decision-making and how individuals collaborate with these agents in economic environments.
My research has two primary strands.
- First, I examine the capabilities, benefits, and trade-offs of deploying large language models (LLMs) as autonomous agents. I develop novel AI tools and run large-scale economic experiments to understand when and how LLM agents replicate, augment, or diverge from human behavior.
- Second, I use LLM-based simulations to uncover new patterns in human decision-making and to design more effective interventions, market mechanisms, and organizational policies. By combining computational modeling with experimental economics, my work aims to inform the responsible deployment of AI in markets and institutions.
My Research
Working Papers
Reject and Resubmit at the Quarterly Journal of Economics.
Extended abstract at the ACM Conference on Economics & Computation (EC '26).
Extended abstract at the ACM Conference on Economics & Computation (EC '26).
EC '26 Exemplary Paper Award
We present an approach for automatically generating and testing, in silico, social scientific hypotheses. This automation is made possible by recent advances in large language models (LLM), but the key feature of the approach is the use of structural causal models. Structural causal models provide a language to state hypotheses, a blueprint for constructing LLM-based agents, an experimental design, and a plan for data analysis. The fitted structural causal model becomes an object available for prediction or the planning of follow-on experiments. We demonstrate the approach with several scenarios: a negotiation, a bail hearing, a job interview, and an auction. In each case, causal relationships are both proposed and tested by the system, finding evidence for some and not others. We provide evidence that the insights from these simulations of social interactions are not available to the LLM purely through direct elicitation. When given its proposed structural causal model for each scenario, the LLM is good at predicting the signs of estimated effects, but it cannot reliably predict the magnitudes of those estimates. In the auction experiment, the in silico simulation results closely match the predictions of auction theory, but elicited predictions of the clearing prices from the LLM are inaccurate. However, the LLM's predictions are dramatically improved if the model can condition on the fitted structural causal model. In short, the LLM knows more than it can (immediately) tell.
Submitted to CSCW.
AAMAS 2026 ES Workshop.
AAMAS 2026 ES Workshop.
As AI usage becomes more prevalent in social contexts, understanding agent-user interaction is critical to designing systems that improve both individual and group outcomes. We present an online behavioral experiment (N=243) in which participants play three multi-turn bargaining games in groups of three. Each game, presented in randomized order, grants access to a single LLM assistance modality: proactive recommendations from an Advisor, reactive feedback from a Coach, or autonomous execution by a Delegate. All three modalities are powered by an LLM with super-human performance within this negotiation setting. On each turn, participants privately decide whether to act manually or use the AI modality available in that game. We document a preference-performance misalignment: participants strongly prefer the higher-control Advisor (44%) over the Delegate (19%), yet groups only significantly increase collective surplus under Delegate access. Adjusting for voluntary non-compliance, delegating to the AI yields suggestive individual welfare gains, roughly 1.5x the intent-to-treat estimate. A mechanism analysis traces this gap to a human filter: AI-generated proposals create more joint surplus than manual proposals across all conditions, but in the Advisor and Coach modes users modify, override, or ignore the AI's suggestions, reverting toward human-baseline trade patterns. The Delegate advantage arises not from a different AI capability but from bypassing this filtering step altogether. Realizing these welfare gains depends not only on model capability, but on the interaction structure through which that capability is delivered. We argue that assistance modalities should be designed as mechanisms with endogenous participation; adoption-compatible interaction rules are a prerequisite to improving welfare with automated assistance.
NeurIPS 2024 (Workshop), EC 2024 (Poster).
This paper investigates the behavior of large language models (LLMs) in auctions, introducing a novel synthetic data-generating process to help facilitate the study and design of auctions. We use the method of moments to estimate human bidding strategies in the results of different experiments reported in the economics literature, for first-price and second-price auction formats and different value environments. We also consider empirical benchmarks from field studies of eBay-style proxy-bidding auctions. In comparing LLMs and humans, we find that LLMs reproduce several qualitative regularities emphasized in the auctions literature—systematic departures from equilibrium in IPV sealed-bid auctions, susceptibility to the winner's curse in common-value settings, and bid sniping under hard-close rules in eBay-style environments—and that LLM behavior is highly responsive to how the incentive mechanisms are presented: describing a strategy-proof sealed-bid auction via an ascending-clock threshold process makes bidding substantially more truthful. We further find that seemingly small design changes—such as moving from sealed-bid to clock-style implementations or modifying the closing rule—induce large and systematic shifts in LLM behavior (as with humans). However, even while summary statistics of LLM bids are similar to those of human bids, the bid distributions diverge in salient ways, revealing where human-likeness breaks down. In this way, we identify regimes in which LLM behavior resembles human behavior and where it does not—offering a methodological template for future work using LLMs as proxies for human behavior in auctions. We release the framework as a reproducible, open-source standard for evaluating economic reasoning in LLMs, flexible enough to run auction experiments with many different models and across a wide range of auction designs.
Peer-reviewed Conference Proceedings
ACM Conference on Intelligent User Interfaces (IUI), 2026.
Markets increasingly accommodate large language models (LLMs) as autonomous decision-making agents. As this transition occurs, it becomes critical to evaluate how these agents behave relative to their human and task-specific statistical predecessors. In this work, we present results from an empirical study comparing humans (N=216), multiple frontier LLMs, and customized Bayesian agents in dynamic multi-player bargaining games under identical conditions. Bayesian agents extract the highest surplus with aggressive trade proposals that are frequently rejected. Humans and LLMs achieve comparable aggregate surplus within their groups, but exhibit different trading strategies. LLMs favor conservative, concessionary proposals that are usually accepted by other LLMs, while humans propose trades that are consistent with fairness norms but are more likely to be rejected. These findings highlight that performance parity — a common benchmark in agent evaluation — can mask substantive procedural differences in how LLMs behave in complex multi-agent interactions.
ACM Conference on Human Factors in Computing Systems (CHI), 2024.
In our current visual-centric digital age, the capability to interpret, understand, and produce visual representations of data—termed visualization literacy—is paramount. However, not everyone is adept at navigating this visual terrain. This paper explores the barriers that individuals who misread a visualization encounter, aiming to understand their specific mental gaps. Utilizing a mixed-method approach, we administered the Visualization Literacy Assessment Test (VLAT) to a group of 120 participants drawn from diverse demographic backgrounds, which provided us with 1774 task completions. We augmented the standard VLAT test to capture quantitative and qualitative data on participants' errors. We collected participant sketches and open-ended text about their analysis approach, providing insight into users' mental models and rationale. Our findings reveal that individuals who incorrectly answer visualization literacy questions often misread visual channels, confound chart labels with data values, or struggle to translate data-driven questions into visual queries. Recognizing and bridging visualization literacy gaps not only ensures inclusivity but also enhances the overall effectiveness of visual communication in our society.
Selected Work in Progress
The theory of simplicity in economic design makes a claim about minds: that certain strategic problems are intrinsically hard to reason about, and that simple design can improve reasoning and recover good outcomes. Large language models (LLMs) represent a new kind of intelligence—neither human nor superhuman—and offer a rare opportunity: to test whether the cognitive constraints that drive simplicity in economic design are specific to humans or reflect something deeper. Existing results establish that LLM agents make mistakes participating in strategy-proof auctions such as the second-price sealed-bid auction, but improve play in simpler, obviously strategy-proof designs; we extend these findings across model families and to two-sided matching, and take this as evidence that the representation of a mechanism—how a game is presented, not just what it implements—shapes LLM reasoning in the same direction it shapes human reasoning. To understand where these cognitive constraints bind, we systematically test prompt-based interventions along three axes from this theory—contingent reasoning, forward planning, and belief formation—as well as descriptions of the mechanisms themselves to make their incentive properties transparent. We find that scaffolding for contingent reasoning, and making incentive properties transparent, substantially improve play, while prompting models to plan forward or to reason about others' beliefs consistently worsens it. The conceptual vocabulary of simple mechanism design, it appears, also describes the limits of an intelligence we did not build the theory for. Understanding why will matter as artificial agents enter economic life.
Selected Honors
- Google DeepMind Seed Fund, 2024
- Introduction to Technical AI Safety Fellowship, 2023
- Purcell Fellowship (Harvard), 2021
- Guo Moruo Scholarship (Highest honor for USTC undergrad students), 2020
- Yan Jici Scholarship (Highest honor for Physics department undergrad students), 2020
