I started my research journey in 2015, when I was a sophomore undergraduate student major in physics at Tsinghua University. At that time, the field of deep learning was just beginning to bloom. The success of AlexNet in 2012 had sparked a wave of enthusiasm, and by 2015, we were seeing rapid advancements in image classification, object detection, and natural language processing. Multiple model architectures and training techniques were being proposed, and the community was eager to explore the potential of deep learning in various domains. There was a direction of research parallel to deep learning, which is Spiking Neural Networks (SNNs). I was fascinated by the idea of mimicking the brain’s neural activity and decided to explore this area. As an undergraduate student, I published my first paper on SNNs in 2017, which was about a learning algorithm that do not require gradient backpropagation. It brought me a lot of confidence and motivation to continue research. However, I soon realized that this direction was not going to lead to foundamental breakthroughs in AI, since all the works in this area were trying to prove the equality between SNNs and ANNs, rather than proposing a fundamentally new learning paradigm.

Starting from 2017, as I began my Ph.D. studies in Computer Science at Tsinghua University, I shifted my focus to the robustness of deep learning models, particularly in the context of computer vision. The discovery of adversarial examples in 2014 had revealed a critical vulnerability in deep learning models, and I was intrigued by the implications of this phenomenon, that the models we had been building relied on “spurious features” that could be easily exploited by adversarial attacks. This is caused by both the limitations of the model architecture and the training process. I wanted to understand why this was happening and how we could fix it. At that time, this field was still in its infancy, many methods were proposed to defend against adversarial attacks, but almost all of them were soon broken by stronger attacks. I knew that this was a fundamental problem, and I wanted to find a path for solution. At the time, a seemedly promising direction was to use brain as a source of inspiration, since the human visual system is known to be robust to adversarial perturbations. I’ve spent a significant amount of time trying to understand the mechanisms of the human visual system and how they could be applied to improve the robustness of deep learning models. I talked to many neuroscientists and read a lot of neuroscience papers. I proposed and experimented with various biologically inspired architectures and training methods, but none of them were able to achieve significant improvements in robustness under strong attacks, just like other methods in the field. I eventually realized that the problem was not with the specific methods, but with the entire research paradigm. Brain-inspired methods did contribute to our understanding of the problem, but was far from the solution, as solid theories and algorithms were still missing. The potential solution of the robustness problem requires a fundamental shift in our understanding of how models learn and generalize, rather than just adding more “biological features” to the models. That period of time was the bottom of my research journey, since I barely published any paper for over two years. On the other hand, it was also the most enlightening one, as it taught me the importance of finding the right research paradigm and the right research questions.

As adversarial examples in the digital world is sort of a “toy setting” and will not directly cause real-world harms, I started to research on it’s counterpart in the physical world in 2020, as a turning from theoretical problem to a more practical one. Under a clear goal of “red-teaming” the robustness of real-world deployed models, I found new research questions and proposed new methods on physical adversarial examples. It finally led to a fruitful research direction in terms of both research and real-world impact, and two of the resulting papers were accepted as oral presentations at top-tier conferences. However, I knew that this was still not the end of the story. The real-world harms of adversarial examples does not bring us closer to the solution, but rather it just rings the alarm bell louder.

After the rise of Large Language Models (LLMs) in 2022, many researchers in the related fields shifted their focus to the safety and security of LLMs, which is a more immediate and practical problem. And so did I, beginning with my postdoctoral research at UC Berkeley. I’ve been researching and publishing papers on jailbreaking and prompt injection related topics. However, the same pattern repeats: people propose defense methods, but they are soon broken by stronger attacks. The problem is still the same: we are treating the symptoms rather than the root cause. The root cause is that LLMs are still “brittle” models that rely on “spurious features” and “shortcuts” to achieve good performance on the training data. The scaling law tells us that the performance (on regular data) of LLMs will continue to improve as we scale up the model size and the training data. However, the robustness of LLMs does not show any sign of improvement, and sometimes it even gets worse, as larger models are more capable, both in good and bad ways. Of course, one can adopt the similar idea as “adversarial training” to directly optimize the robustness of LLMs and scale up the training data using SFT or RL. However, as long as the attacker (the red team) moves second, it will always be able to find new vulnerabilities, such as allowing more context length for the attack, or finding new attack scenarios (vulnerabilities in facts, reasoning, etc). The space of possible attacks is nearly infinite, and might need an non-polynomial scaling factor to be fully covered. It is not only for safety and security issues, but also for other types of robustness issues such as hallucination and out-of-distribution generalization. This reflects a fundamental limitation of the current learning paradigm, and we need a new one, not just for a new architecture or a new training method, but a fundamentally new way of learning that can lead to robust models by design. The potential solution might come from the direction of causal inference and invariant learning, and combined with some types of world models, which aims to learn the underlying causal structure of the data rather than just the statistical correlations. But these lines of research is still in its early stages, and there are many open questions and challenges to be addressed. Moreover, it requires cross-disciplinary efforts, so that we can leverage insights from different subfields of AI, as well as from other disciplines such as neuroscience and cognitive science.

As a researcher, I am always looking for the next big challenge, and I am excited to continue exploring the frontiers of AI research. My ultimate goal is to contribute to the development of Artificial General Intelligence (AGI) which is smart enough and responsible enough to benefit humanity. I believe that the pursuit of robustness is not just about fixing vulnerabilities, but about probing the fundamental “knowledge and reasoning” capabilities of deep learning models.


I used LLMs for the help in refining the expression and structure of this post.