Haipeng Chen

Email: hpchen@seas.harvard.edu; haipengkeepon@gmail.com

I am a CRCS postdoctoral fellow at Harvard University from July 2020, where I work with Professor Milind Tambe and Professor Hima Lakkaraju. Previously I did my first postdoc in the Department of Computer Science, Dartmouth College from July 2018, where I worked with Professor V.S. Subrahmanian. I obtained my PhD from Interdisciplinary Graduate School (IGS), Nanyang Technological University (NTU) in 2018, where I was advised by Professor Bo An. I got my B.S. in Physics from University of Science and Technology of China (USTC) in 2013.

I study AI techniques that help understand the world and benefit the society. In particular, I am interested in reinforcement learning, graph/network-structured learning models, generative models, and predictive decision making. I am also interested in social aspects of AI systems such as interpretability, fairness and security. I work on domains such as public health, cybersecurity, transportation, conservation, and online social networks.



  • I am co-organizing the Synthetic Data Generation: Quality, Privacy, Bias workshop @ICLR2021.
  • December, 2020 - Our paper, 'Active Screening for Recurrent Diseases: A Reinforcement Learning Approach' is accepted by AAMAS'2021!
  • December, 2020 - Our workshop proposal, 'Synthetic Data Generation: Quality, Privacy, Bias' is accepted by ICLR'2021!
  • December, 2020 - Invited to serve as Senior PC of IJCAI'2021!
  • December, 2020 - Our paper, 'EvaLDA: Efficient Evasion Attacks Towards Latent Dirichlet Allocation' is accepted by AAAI'2021!
  • September, 2020 - Organzing Harvard CRCS workshop on Using AI for Social Good!
  • August, 2020 - Our team AMI (Bo An, Haipeng Chen, Xu He, Rundong Wang, and Youzhi Zhang; alphabetically ordered) made it to the finalist (top 25 out of 1000+ teams) in the KDD'20 reinforcement learning competition track!
Load More


Last update: Dec-2019

Selected Research Projects

Nowadays, governments face a worsening problem of traffic congestion in urban areas. To alleviate road congestion, a number of approaches have been proposed, among which ETC has been reported to be effective in many countries and areas (e.g., Singapore ERP, and Norway AutoPASS). The tolls of different roads and time periods are different, so that the vehicles are indirectly regulated to travel through less congested roads with lower tolls. However, although current ETC schemes vary tolls at different time periods throughout a day, they are predetermined and fixed at the same periods from day to day.

To make the tolling fully dynamic, adaptive and optimized, we handle the dynamic tolling from a city scale, and model the dynamics of a tolling road systems as a high-dimensional Markov Decision Process (MDP).

Figure 1 - A multi-agent deep RL model with graph convolutional networks representation of policy/value functions.

The state of the formulated MDP is the number of vehicles on a road that are heading to certain destination, and the action is the toll on each road. Our solution to the formulated MDP is characterized by the following properties (see Figure 1 as an illustration):
  • A novel deep reinforcement learning (RL) approach which utilizes policy gradient along with a Beta-distribution based policy function.
  • A multi-agent RL perspective which decomposes the original centralized training into a distributed manner.
  • A graph convolutional networks (GCN) representation of the policy/value function which exploits the graphical property of the traffic network.


Other papers related to traffic control with reinforcement learning:

In complex multiagent systems, sequential decision making has long been a significant challenge with many characteristics. One prominent characteristic is the interactions among the agents. In many practical scenarios (e.g., the Stag-Hunt game), agents need to cooperate with each other to achieve a common goal with high rewards (e.g., hunting stags), while each individual agent is self-interested and might deviate from the cooperation for less risky goals with low rewards (e.g., hunting rabbits). Another critical characteristic comes from various uncertainties. One type of uncertainty arises from the lack of accurate knowledge of environment and other agents where the uncertain information can be modelled probabilistically. A more challenging type of uncertainty lies in the environment-related factors which we do not know how to model.

These two challenges are significantly amplified when it comes to sequential decision making, where we need to look at not only short term rewards but also rewards in the long run. Therefore, one has to consider subsequent effect of the current action, especially in a dynamic environment. Another crucial characteristic is the limited learning trials. On the Minecraft platform, it usually takes several seconds to complete one episode of game. Therefore, it is extremely time consuming to learn an effective policy.

The Microsoft Malmo Collaborative AI Challenge (MCAC), which is designed to encourage research relating to various problems in Collaborative AI, builds on a Minecraft mini-game called “Pig Chase”. Pig Chase is played on a 9 × 9 grid where agents can either work together to catch the pig and achieve high scores, or give up cooperation and achieve low scores. After playing certain episodes (e.g., 100) of games, the agent who achieves the highest average scores wins the challenge. Despite its simple rules, this game has all the above stated key characteristics. Though there are numerous papers studying sequential decision making in complex multiagent systems, they only address a subset of these characteristics. We hope to shed some light on solving this class of problems by presenting HogRider, a champion agent which won 2017 Microsoft Malmo Collaborative AI Challenge.

The solutions underlying HogRider are characterized by

  • A novel agent type hypothesis approach for dealing uncertainties about the type of the other agent and the observation of the other agents' actions.
  • A novel Q-learning framework to learn an optimal policy against each type of agent, which 1) employs state-action abstraction to reduce state-action space, 2) utilizes warm-start of Q-functions with pre-traning, and 3) exploits active-ε-greedy for active exploration of potentially more beneficial actions.

Paper and media coverage:

Other papers at the intersections of Data, Machine Learning, and Game Theory:

The number of software vulnerabilities disclosed every year is staggering, leading to increasing risks for system security officers. Because patching is expensive (time for patch installation time, patch purchase and risk of disruption to production systems), many vulnerabilities go unpatched because of limited resources to tackle thousands of patching tasks. To help alleviate the situation, when a new vulnerability is discovered, a Common Vulnerability and Exposure (CVE) numbering authority (such as the MITRE Coporation) assigns a CVE number to that vulnerability, along with a brief description. The US National Institute of Standards and Technology (NIST) then studies the CVE and releases a severity score as well as its associated attributes (e.g., attack vector, attack complexity) via the Common Vulnerability Scoring System (CVSS). However, we have observed that on average, it takes NIST around 130 days to conduct the investigation and analysis. In this project, we extract online discussions (e.g., Twitter) of the vulnerabilities, and predict i) when a vulnerability will be exploited, and ii) how severe that vulnerability is.



C=Conference, J=Journal, D=Demo, W=Workshop, A=Arxiv, P=Patent







Awards & Certifications

Professional Services

Excellent Reading Materials