Yau Awards Archive 2020 — 2025

CHAPTER SEVEN第七章

AI 工具在丘成桐中学科学奖中的合理使用 Using AI Tools Responsibly in the Yau Award

07 · From the Whitepaper, v22.0 · May 2026 摘自白皮书 v22.0 · 2026 年 5 月


进入 2023 年之后,几乎每一个进入丘奖总决赛的参赛团队,或多或少都在研究过程中使用过生成式 AI、机器学习模型或 AI 辅助的数据分析工具。然而与 ISEF 自 2024 年起对 AI 使用做出明确披露要求1不同,丘成桐中学科学奖目前并未在公开规则中专门列出关于 AI 使用的条款。这种「规则真空」并不意味着学生可以随意使用 AI ——恰恰相反,由于评委(绝大多数为中外顶尖科学家及一线学者)对原创性、独立思考与数学严谨性有近乎苛刻的要求,AI 的滥用反而比在 ISEF 中更容易造成致命伤害。本章将系统梳理:丘奖评委对 AI 使用的真实态度、各学科 AI 使用的差异、研究流程中应该用哪些工具、如何在论文与答辩中合规披露 AI 协助,以及在评委追问下如何答得有底气。

为什么需要专门谈丘奖中的 AI 使用

第一,丘奖与 ISEF、英特尔奖(现 Regeneron STS)一脉相承,但更强调论文。 丘成桐先生在多个公开访谈中强调,丘奖「不是考试,是做一个论文」,整个比赛的核心是研究报告本身——所有创新性、严谨性、写作水平都凝结于这一份提交给国际评审委员会的文本。一旦评委发现论文中存在 AI 直接代笔的痕迹,无论实验数据多么扎实,都会被判定为学术不诚信。

第二,丘奖没有现场展板,但有英文答辩。 不同于 ISEF 的 trade-show 形式,丘奖总决赛在清华大学举行,全程使用英文答辩,由国际评审委员会主持。 答辩中评委会针对论文细节进行深度追问。若论文是 AI「润色」过的,但学生口头表达能力却与之严重不符,评委一眼就能看出落差。

第三,丘奖鼓励独立思考。 丘成桐本人在专访中说:「奥数是出个题目给你做,丘奖是自己出题目自己做。这是一个很重要的能力,做研究总是要自己找题目」,「在现有研究基础上,加上一点儿原创性的想法,就很好」。 这与「让 ChatGPT 帮我想个研究方向」形成尖锐对立。学生若把选题、文献综述、结果讨论这些研究核心环节统统外包给 LLM,本质上违背了丘奖的设立初衷。

第四,丘奖评审委员会成员的学术敏锐度极高。 顾问委员会与评审委员会中包括菲尔兹奖得主、诺奖得主、多位美国及中国两院院士。 这些学者长期处于学术诚信审查的第一线,对 AI 生成文本特征(过度对称的句式、空泛的形容词、不自然的过渡段、过于流畅的英文文献综述)以及 ML 项目中的常见漏洞(数据泄漏、过拟合、未交叉验证)有高度直觉。 简言之,AI 痕迹在丘奖现场比在大多数学生想象中要明显得多。

第五,规则真空既是机会也是风险。 当规则不明确时,学生可能误以为「不用披露就不算违规」。 但学术诚信是普适标准,不依赖于某次比赛的具体条款。 我们建议参赛者主动按 ISEF 的披露标准要求自己,把 AI 使用情况清晰写入论文的方法学(Methods)与致谢(Acknowledgements)部分。

丘奖评委对 AI 使用的真实态度

通过对过去三届(2023–2025)总决赛公开访谈、获奖论文及科学论坛 2内容的梳理,我们整理出以下几点评委共识——这些共识并未出现在任何官方文件中,但反复出现在评委的口头反馈与媒体采访中:

  1. 评委默认你会用 AI,但不允许你不理解 AI。 2024 与 2025 两届计算机奖的金奖论文都直接以 LLM 为研究对象(2024 金奖《LLM Mathematical Reasoning Grounded with Formal Verification》;2025 金奖《PV-Care: Using Low-Density EEG and AI to Provide Proactive Help for MCI》)。 评委对学生「用 AI」毫无成见,但对「学生不懂自己用的 AI 在做什么」零容忍。 一句「我用了 GPT-4o 跑了一下」在丘奖答辩现场是死亡回答。

  2. 评委会问到你的训练细节而非概念。 与一般想象不同,评委不会问「什么是 transformer」,他们默认你会。 他们会问:「你的数据是怎么切分的?训练集与验证集是否同分布?」「你用 cross-attention 的具体维度是多少,为什么这样设计?」「你模型在 baseline 上的提升是 1% 还是 10%?这个差异在你的样本规模下显著吗?」 这些问题考察的是真实操作经验,临阵抱佛脚很难蒙混过关。

  3. 数学奖评委对 AI 极度警惕。 数学奖的评委多为纯数学领域学者(丘成桐本人、李骏、肖杰、朱熹平等)。 他们最反感的就是「定理是 AI 帮我推的、证明是 AI 帮我写的」。 在数学奖的答辩中,被要求当场写出证明的每一步是常态,几乎所有数学奖金、银奖得主都能在白板上重现自己的关键引理。 这一点请所有想用 AI 推数学证明的学生引以为戒。

  4. 评委不只关心「用没用」,更关心「为什么用、为什么这样用」。 丘奖论文通常需要在答辩中展示研究动机(why this problem)、研究方法(why this approach)、研究意义(why this matters)。 如果学生回答「我用 ResNet 做图像分类」却答不出「为什么不用 ViT」「为什么不用更小的 MobileNet」,评委会判断学生只是在套用现成方案,缺乏研究品味。

  5. 评委对原创性的判定有自己的内部尺度。 丘成桐在专访中曾说,「在现有研究基础上,加上一点儿原创性的想法,就很好」。 这句话翻译成 AI 时代的话语:评委可以接受你用 AI 做了 90% 的工程实现,但你必须有 10% 是真正属于你自己的、AI 无法替代的洞察——可能是一个新的损失函数、一个新颖的数据预处理思路、一个对模型失败模式的独到分析。

各学科 AI 使用差异

丘奖六大学科对 AI 的接受度差异极大。 套用同一套 AI 使用策略到不同学科上,是导致大量入围作品停步于半决赛的隐形原因。

数学奖——AI 是辅助工具,不是合作者。 数学奖的核心是证明,而 LLM 在严肃证明上仍然不可靠,且经常产生幻觉。 评委对「AI 协助推导」抱有结构性怀疑。 可以接受的使用:用 SymPy/Mathematica 验证手推公式;用 Lean、Coq 或 Isabelle 等形式化证明系统对关键步骤进行机器验证(2024 年金奖论文正是这一思路);用 LLM 检索相关文献。 不可接受的使用:将 LLM 生成的证明粘贴入论文;让 LLM 替你「想」一个数学问题。

物理奖——AI 在数据分析中合理,在理论建模中要慎重。 物理奖既有理论方向,也有大量实验/数值方向。 评委可以接受用 PINN(物理信息神经网络)、用 CNN 处理实验图像、用 ML 做参数反演——2025 年物理奖入围奖《PINN-LOCK: Efficient Density Current Simulation via Physics-Constrained Loss》即为代表。 但要警惕:如果你的物理建模本身是「让 GPT 列了几个方程然后我代了数」,评委会通过追问偏微分方程的边界条件、量纲分析或哈密顿量来揭穿。

化学奖——计算化学方向欢迎 AI,实验化学方向 AI 价值有限。 化学奖的评委对 DFT 计算、MOF/COF 设计、分子动力学模拟中使用机器学习评分函数完全没有意见——2024 年化学奖优胜论文《Computational Screening and Design of Metal-Organic Frameworks for CO2 Separation》就是范本。 但若你的研究是合成化学或材料制备,AI 的角色应该限制在文献检索与表征数据处理上,不应替代真实实验。

生物奖——AI 几乎已成标配,但要管好「数据来源」这一关。 2024、2025 连续两届生物奖中均出现 AI 辅助药物设计、单细胞分析、医学影像分析等论文(如 2024 优胜《AI-Guided Design and Preliminary Validation of Anti-Tuberculosis Subunit Vaccine》)。 生物奖评委对 AI 工具本身很熟悉,但他们最关心的是:你的训练数据从哪里来?是否有伦理审批?是否处理了 batch effect 与 confounding?是否做了独立验证集?这些问题答不出,AI 再花哨也没用。

计算机奖——AI 就是这个学科本身。 计算机奖近三届金奖、银奖几乎全是 LLM、生成式模型、Agent 系统、多模态学习方向(2025 银奖《CraftMesh》使用泊松融合做生成式 3D 网格操作;铜奖中有基于 LLM 多智能体博弈的研究)。 评委默认你会调用 OpenAI/HuggingFace API、会用 PyTorch、会做 fine-tuning。 在这个学科里,敢用 AI 不是加分项,而是入场券。 真正的加分点是:在已有大模型生态之外做出独立贡献——比如新的训练范式、新的评测基准、对模型行为的新解释。

经济金融建模奖——AI 在数据处理与建模中合理,在经济推理中危险。 经济金融建模奖里,使用机器学习做股票收益预测、用 NLP 处理财报文本、用强化学习做交易策略都已属常态——2025 金奖《Firm-Level Impacts of Artificial Intelligence on Labor Demand》直接把 AI 本身作为研究对象。 但评委对一类 AI 使用极度反感:让 LLM 替学生「推理」经济学机制(如「让 GPT 帮我解释为什么这个变量与那个变量相关」)。 经济学的核心是因果识别(causal identification),这件事 LLM 目前完全做不来,评委一问你的工具变量是什么、平行趋势假设是否成立,AI 给你的答案立刻露馅。

研究流程中的 AI 工具栈

下面按照丘奖标准的研究流程——选题、文献综述、实验/算法实现、数据分析、论文写作——分阶段推荐工具。 推荐的依据是 2024–2025 年获奖学生反馈与 Thinker Education 教研团队的实操经验。

阶段一:选题与可行性评估

  • Perplexity(perplexity.ai):与传统搜索引擎相比,Perplexity 会给出带引用来源的答案,适合快速判断「我这个想法在文献中是否已有人做过」。 注意:Perplexity 的引用质量参差,需要逐条点开核实。

  • Elicit(elicit.com):专注于学术文献的 AI 工具,可以输入研究问题后返回相关论文清单与每篇论文的关键发现摘要,对快速判断方向的成熟度非常有用。

  • Consensus(consensus.app):用于对某个具体科学命题(如「Omega-3 对认知改善有效吗」)寻找文献中正反两方的证据。 经济金融建模奖与生物奖选题时尤其好用。

  • ChatGPT / Claude 用于头脑风暴:可以向 LLM 描述你的兴趣方向,让它列举 10 个可能的研究问题,然后筛选。 但要切记:LLM 的建议只是起点,最终选题必须经过你与指导老师反复论证,不能拿来即用。

使用建议:选题阶段的 AI 使用边界是「帮助你发散」,而不是「替你决策」。 把 LLM 当作一个不知疲倦的讨论伙伴,但每一个候选选题必须经过你自己的可行性评估(实验设备、时间、知识储备)之后才能确定。

阶段二:文献综述

  • Connected Papers(connectedpapers.com):输入一篇种子论文,自动绘制相关论文的网络图,对于发现「同一研究脉络上的关键文献」非常高效。

  • Semantic Scholar / Google Scholar:传统但仍然最权威的学术搜索工具。 Semantic Scholar 提供 AI 生成的论文 TLDR 摘要,可大幅提升筛选效率。

  • Notion AI / Obsidian + Smart Connections 插件:将精读过的论文笔记导入 Notion 或 Obsidian,借助 AI 在自己的笔记库中做语义搜索,帮助你在写作时快速定位「我之前读到过哪一篇说了类似的事」。

  • NotebookLM(Google):可以将多篇 PDF 文献上传,LLM 在你限定的语料范围内回答问题。 比让 ChatGPT 「裸答」可靠得多,且引用清晰,便于核对。

使用建议:文献综述的核心目标是「让评委看出你真的读过这些论文」。 任何一篇被你引用的文献,你应能在答辩中复述其核心方法、关键结论与局限性。 用 AI 工具加速筛选可以,但精读环节没有任何可以走的捷径。

阶段三:实验/算法实现

  • Cursor / Windsurf(AI 代码编辑器):集成了 GPT-4/Claude 的代码补全与对话能力,适合从零搭建一个完整工程(数据预处理、模型训练、可视化)。

  • GitHub Copilot:行级与函数级补全,对熟悉某个 stack(PyTorch、scikit-learn 等)的学生提速明显。

  • Claude / ChatGPT 用于 debug:遇到难以定位的报错,把堆栈贴给 LLM 让它分析,比独自盯屏幕高效。 但要主动核对 LLM 的解释,LLM 有时会自信地给出错误归因。

  • HuggingFace Transformers / Diffusers:对于需要预训练大模型的计算机奖、生物奖项目,HuggingFace 是事实上的工具源。

  • Wolfram Alpha / Mathematica:数学奖、物理奖在符号计算环节的标准工具。 评委对此完全接受。

使用建议:所有 AI 生成的代码,你都必须自己看懂每一行——尤其是模型定义、损失函数与训练循环。 答辩时评委可以指着你 PPT 中的某一段伪代码问你「这一步在做什么」,回答不出来等同于宣告这段代码不是你的。

阶段四:数据分析

  • Pandas + Matplotlib + Seaborn:数据清洗与可视化的核心栈。 可以让 ChatGPT 帮你写绘图代码,但所有图表的统计学意义必须由你自己解读。

  • Statsmodels / R + tidyverse:经济金融建模奖中做计量经济回归、显著性检验时优先使用 statsmodels 而非 sklearn——前者输出标准误、p 值、稳健性诊断,是评委关心的。

  • ChatGPT Code Interpreter(Advanced Data Analysis):可以上传 CSV 让 ChatGPT 直接执行 Python 分析。 适合做探索式数据分析(EDA),但任何写入论文的统计结果都必须在本地脚本中复现,不能直接复制 ChatGPT 的输出。

  • Weights & Biases(wandb.ai):用于深度学习实验的训练日志与超参数对比。 不是 AI 工具,但是答辩时展示你做了系统化调参的最佳证据。

使用建议:数据分析阶段最容易出现的 AI 滥用,是让 LLM「解释这些数字代表什么」。 评委对此类「事后合理化」非常敏感——你需要在做分析之前就形成假设,分析之后用数据验证或推翻假设。 把统计推断让给 AI 等于放弃了研究的灵魂。

阶段五:论文写作与英文润色

  • Grammarly / LanguageTool:用于语法、拼写、句式检查。 这一层 AI 使用没有任何争议,评委也不会反感。

  • DeepL / 谷歌翻译:对于母语为中文的学生,可以先用中文写出初稿,再用 DeepL 翻译并自行修改。 切勿一稿到底依赖机器翻译——丘奖论文的英文质量是评委对学生综合素养判断的重要依据。

  • ChatGPT / Claude 做表达润色:可以让 LLM 对某一段做风格润色,但不可让其改写或重组段落结构。 一个稳妥的提示词是:「请保留我的论证结构与术语选择,仅在语法与流畅度上做最小幅度的修改。」

  • Paperpal / Writefull:针对学术英文做了专门训练的润色工具,比通用 LLM 更注重学术语境(如时态、被动语态规范)。

使用建议:论文的核心论证、章节框架、关键句子(尤其是摘要的最后一句、引言的研究空白定位、结论中的「我们的贡献是…」)必须出自你自己。 AI 可以帮你「写得更好」,但绝不能替你「想出该写什么」。

披露与诚信:在丘奖中应如何表述 AI 使用

虽然丘成桐奖目前没有强制披露条款,但我们强烈建议参赛者参照 ISEF 2024 起的标准,主动在论文中披露 AI 使用情况。 这不仅是学术诚信的体现,也能在评委追问时让你处于完全主动的位置。 具体披露的位置与措辞建议如下。

致谢部分(Acknowledgements):对所有非核心的 AI 协助统一披露。 推荐措辞示例:

We acknowledge the use of ChatGPT (GPT-4, OpenAI) and Claude
(Anthropic) for language polishing and code-level debugging
assistance throughout this project. All algorithmic design,
experimental decisions, data analysis, and core conclusions
reported in this paper are the work of the authors. Generative
AI was not used to formulate research questions, generate
data, or derive theoretical results.

方法学部分(Methods):当 AI 工具是研究方法的实质组成部分时,须在 Methods 中具体说明模型、版本、训练参数、评估指标。 例如:

For semantic classification, we fine-tuned the BERT-base-uncased
model (Devlin et al., 2019) on a custom dataset of 4,217
manually annotated samples. Training used AdamW optimizer
(learning rate 2e-5, batch size 16) for 5 epochs on an
NVIDIA A100 GPU. Model performance was evaluated using
five-fold cross-validation with macro-F1 as the primary
metric.

在答辩中如何回应「你用 AI 了吗」:应避免两种极端反应——一种是慌张否认(容易被识破),另一种是过度坦白(把每件小事都报告一遍,给评委「你的工作绝大部分是 AI 做的」的印象)。 推荐的回应框架是「分层披露 + 强调独立贡献」:

是的,在 X 环节我使用了 {tool} 来 {specific task},主要原因是 {reason}。 不过 {core insight / decision} 是我经过 {specific reasoning process} 后做出的,AI 在这一环节没有起决定作用。如果您想了解这一部分的细节,我可以现场推导/演示。

这套话术的核心是把球踢回到你最有把握的细节上,让评委看到你随时可以脱离 AI 完成自己的工作。

评委追问的典型问题与应对(沙盘推演)

下面整理了 2023–2025 三届总决赛中评委高频追问的 AI 相关问题,每题给出「不应回答」「推荐方向」「应避免的陷阱」三个维度的分析。 建议参赛学生在赛前对照自己的项目逐题预演。

  1. 问:「你用 ChatGPT 写论文了吗?」

    • 不应回答:「没有,完全没用过。」(一旦评委从行文风格中捕捉到 LLM 特征,将直接判定不诚信。)

    • 推荐方向:坦承用于语法与表达润色,明确指出研究设计、文献综述、结果讨论均出自本人,并主动说明你校核了 LLM 修改后的每一处变化。

    • 陷阱:把「我让 GPT 帮我改 abstract」也说成「我没用 AI」,这种小事被揭穿后会拖累整体可信度。

  2. 问:「你这个模型为什么会给出这个结果?」

    • 不应回答:「这是模型自己学出来的。」

    • 推荐方向:从模型架构(如 attention 机制如何捕获长距离依赖)、训练数据特征、损失函数设计三个层次给出机制性解释,必要时辅以 attention map 或 saliency 可视化。

    • 陷阱:把「黑箱」当作答案。 评委的潜台词是:你是否真的理解你训练的东西?

  3. 问:「如果不让你用 AI,你还能完成这个研究吗?」

    • 不应回答:「不能。」或「完全可以。」(前者贬低自己,后者显得在撒谎。)

    • 推荐方向:坦诚指出 AI 在哪几个具体环节大幅压缩了时间(如代码实现、文献筛选),但研究的核心思想——问题定义与方法选择——是你独立完成的;如果没有 AI,时间成本会显著增加,但研究的本质结论不会改变。

    • 陷阱:表现出对 AI 的过度依赖。

  4. 问:「你的训练数据是从哪里来的?标签是怎么得到的?」

    • 不应回答:「就是从网上下载的。」

    • 推荐方向:说明数据集来源(公开数据集请引用其原始论文及 license;自采数据请说明采集协议与样本规模),标签获取方式(人工标注请说明标注人数与一致性度量,弱监督请说明启发式规则),并主动谈及数据偏差与覆盖范围的局限。

    • 陷阱:忽视数据合规与隐私问题。 涉及人体数据时,伦理审批是评委最常追问的点。

  5. 问:「你的 baseline 是什么?为什么选这些 baseline?」

    • 推荐方向:至少给出三层 baseline:一个朴素方法(如随机猜测或线性回归),一个传统强基线(如 XGBoost、SVM),一个本领域近期 SOTA。 说明你选择这三个 baseline 是为了从不同维度证明你方法的提升不是来自调参或偶然。

    • 陷阱:只与一个弱 baseline 对比,或与不公平的 baseline 对比(如让自己的方法用大模型而 baseline 用小模型)。

  6. 问:「你怎么验证 AI 没有出错?」

    • 推荐方向:从三方面回答——内部验证(交叉验证、消融实验)、外部验证(独立测试集或外部数据集)、机制验证(attention/feature attribution 是否符合领域常识)。

    • 陷阱:只报告训练集精度。 评委一听就知道你没有真正做过 generalization 测试。

  7. 问:「如果我把你的方法搬到 {相关但不同的领域},它还能用吗?」

    • 推荐方向:分析你的方法在哪些假设下成立(如数据分布、样本规模、特征类型),并诚实指出在新领域可能需要的修改(如重新预训练、调整 architecture、引入领域知识)。

    • 陷阱:盲目宣称「适用所有领域」。 这是经典的过度自信信号。

  8. 问:「你引用的这篇论文,能不能用一句话总结它的核心贡献?」

    • 推荐方向:精读过的论文应能在 15 秒内说出其问题、方法与结论。 凡是引用的文献都要做到这一点。

    • 陷阱:用 LLM 帮你筛选并引用了大量你没读过的论文。 评委随便抽一篇就能戳穿。

  9. 问(数学奖专属):「你这个引理 / 定理能不能在白板上重写一遍证明?」

    • 推荐方向:直接拿笔上前推导。 不要犹豫,不要拒绝。

    • 陷阱:含糊地说「证明很长,我现在记不全」。 这相当于宣告这不是你的工作。

  10. 问(经济金融建模奖专属):「你的因果识别策略是什么?工具变量是否满足排他性约束?」

    • 推荐方向:明确指出识别策略(DID、IV、RDD 或随机实验),交代关键假设,给出对假设的稳健性检验(placebo、bandwidth sensitivity 等)。

    • 陷阱:把简单回归的相关性解读为因果。 这是经济金融建模奖中评委最高频的扣分点。

从过往获奖论文看 AI 应用范式

为了让上述抽象建议具象化,我们从 2024、2025 两届获奖论文中选取若干具有代表性的 AI 应用案例进行简析。 以下分析仅基于论文标题与公开摘要,不对未公开的论文方法细节做臆测。

案例一:2024 计算机奖金奖《LLM Mathematical Reasoning Grounded with Formal Verification》(华润小径湾贝赛思国际学校,何坤朗,指导老师:上海交大严骏驰)。 这是一个堪称范式级的 AI 项目:研究者并不简单地「调用 GPT 做数学题」,而是将 LLM 的推理过程与 Lean 等形式化证明系统结合,让 LLM 给出的每一步推理都接受机器验证。 这种「LLM 提出候选 + 形式化系统过滤」的架构正是 2023 年以来学术界(如 DeepMind AlphaProof)的前沿方向。 范式启示:不要把 LLM 当成黑箱终点,把它当作一个需要被监督的中间组件,是高分项目的共同特征。

案例二:2025 计算机奖金奖《Beyond Reactive Assistance: PV-Care Using Low-Density EEG and AI》(上海中学国际部)。 这一项目用低密度 EEG 信号训练模型,主动识别轻度认知障碍患者的状态并提前介入。 核心贡献并非「我用了一个 transformer」,而是把 EEG 信号处理、AI 推理与具体临床场景(MCI 患者的预防性辅助)耦合在一起。 范式启示:评委最看重的是 AI 解决了一个具体的、有社会意义的问题,而不是技术本身的复杂度。

案例三:2025 计算机奖银奖《CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion》(合肥安生学校)。 这一项目将经典图形学方法(泊松融合)与生成式 AI 结合,解决三维网格编辑中的接缝问题。 范式启示:把领域知识(这里是几何处理)与生成式 AI 结合,往往比纯 AI 方法更受评委青睐——因为它显示出学生既懂传统方法又懂前沿模型。

案例四:2025 物理奖银奖《Development of a High-Efficiency Objective-Prism Stellar Spectrograph And Construction of its Dedicated AI Classification Model》(北师大附属实验中学)。 这一项目是「硬件 + AI」的典范——学生自己搭建了一台物棱光谱仪(硬核物理工程),并训练了 AI 分类模型对采集到的恒星光谱进行自动归类。 范式启示:物理奖中纯 AI 的项目较难胜出,但是 AI 作为一个「让你的物理实验成倍提速」的工具时,会成为巨大加分项。

案例五:2024 生物奖优胜《AI-Guided Design and Preliminary Validation of Anti-Tuberculosis Subunit Vaccine》。 这一项目展示了 AI 辅助药物/疫苗设计的标准范式:用 AI(蛋白结构预测、表位预测、亲和力计算)筛选候选,再通过湿实验验证若干个 top candidates。 范式启示:评委非常看重「AI 筛选 + 真实实验验证」的闭环——只有计算结果没有实验,会被视为「纸上谈兵」;只有实验没有计算,则被视为「没有用上 AI 时代的方法论」。

案例六:2025 经济金融建模奖金奖《Firm-Level Impacts of Artificial Intelligence on Labor Demand: Evidence from Online Job Postings》(上海星河湾双语学校)。 这一项目把 AI 本身作为研究对象,用网络爬取的招聘文本构建数据,用 NLP 模型识别岗位的 AI 相关性,再做计量识别。 范式启示:经济金融建模奖中,AI 作为研究对象 + 计量识别作为研究方法,是近两年公认的高分组合。

常见误区

误区一:用 AI 越多,论文越「现代」。 评委见过太多堆砌 transformer、扩散模型、强化学习的论文。 模型本身的「现代性」并不能带来加分;研究问题是否值得做、AI 是否解决了问题,才是判定标准。 一篇用 logistic regression 解决真问题的论文,比一篇用 100 亿参数大模型炫技但不解决问题的论文要强得多。

误区二:评委不懂 AI,所以可以糊弄。 丘奖顾问委员会与评审委员会中有大量计算机、统计、生物信息背景的教授,他们对 AI 的了解程度远高于绝大多数高中生。 即便是数学、物理领域的纯理论评委,也通过日常科研接触 AI 的频率远高于学生想象。 想用术语堆砌或刻意复杂化来糊弄评委,结果只会被反问得更深、更狠。

误区三:让 LLM 替我写中文 / 英文摘要,没人能发现。 LLM 生成的文本有显著特征:过度对称的句式、空泛的形容词(「significant」「novel」「remarkable」高频出现)、不自然的过渡词(「moreover」「furthermore」连用)、避免具体数字的倾向。 评委——尤其是常年带博士生的学者——对这些特征非常敏感。 一旦摘要的英文风格与你答辩时的口语水平形成明显落差,怀疑就已经产生。

误区四:用了开源模型就一定合规。 开源模型并不意味着可以无条件使用。 不同模型的 license 差异极大:LLaMA 系列有商业使用限制;HuggingFace 上很多模型禁止用于医疗、法律等高风险场景;某些数据集(如 ImageNet)禁止商业用途。 评委不会查每一个 license,但若涉及发表或后续推广,license 问题会成为隐患。

误区五:AI 工具的使用不需要在参考文献中标注。 对于实质性参与研究的 AI 工具(如某个特定的预训练模型、某个数据分析平台),应该像引用学术论文一样在参考文献中标明。 例如使用 BERT 模型应引用 Devlin et al. (2019),使用 GPT-4 可引用 OpenAI 的 technical report。 对于通用润色(语法检查等),可在致谢中合并说明。

误区六:AI 给我的统计结果可以直接放进论文。 ChatGPT 的 Code Interpreter、Claude 的 Analysis 模式都可以输出统计结果,但这些结果可能存在浮点精度问题、版本差异、未公开的预处理步骤。 任何写入论文的统计数字都应在你本地 Python/R 脚本中独立复现,并存档完整的可执行 notebook,以备评委追问。

自查清单:提交论文前的 AI 使用合规自查

在你最终提交论文给丘奖之前,请逐条核对以下清单。 任何一项打不上勾,都应回去补救。

  1. 我能解释论文中每一段 AI 协助过的文字,并能在不借助 AI 的情况下用自己的话重述其内容。

  2. 我能在白板上推导论文中所有关键公式 / 算法步骤,无需查阅任何资料。

  3. 我能脱离 PPT,仅凭口头讲述完整地汇报本研究的研究问题、方法、结果与结论。

  4. 我能精确说出训练数据的来源、规模、清洗规则、标签获取方式与潜在偏差。

  5. 我能精确说出模型的架构、参数量级、训练超参数(学习率、batch size、epochs)与训练设备。

  6. 我做了至少一次完整的交叉验证或独立测试集评估,结果与论文中报告的指标一致,且我保留了完整的训练日志(如 Weights & Biases 链接或本地 CSV)。

  7. 我至少与三个不同强度的 baseline(朴素、传统强基线、近期 SOTA)进行了对比,并能在答辩中说明为什么选这些 baseline。

  8. 我在论文的致谢与/或方法学中明确披露了 AI 工具的使用范围(具体到工具名称、版本与用途)。

  9. 我引用了所使用的预训练模型 / 数据集 / 算法的原始论文,未把 AI 生成的引文直接复制粘贴。

  10. 我准备好回答评委关于「AI 在你研究中扮演什么角色」「不用 AI 你还能做到吗」等开放式问题的回答框架。

  11. 我对论文中任何 AI 生成的图表与数字都做了本地复现,并保存了可执行脚本。

  12. 我的论文摘要、引言开头与结论的关键句子,是我亲手写的,没有让 AI 改写或重组结构。

结语。 AI 在丘奖中的角色,正在从「需要解释为什么用」转向「需要解释为什么不用」。 但即便如此,评委关注的核心从未改变——你是否真正提出了一个值得回答的问题,你是否真正理解你所用的方法,你是否在原创性上做出了哪怕一点点真正属于你自己的贡献。 工具是新的,标准是旧的。 这正是丘成桐奖在 AI 时代仍然具备甄别能力的根本原因。


  1. ISEF 2024 起新增的 AI 披露要求详见美国科学与公众学会(Society for Science)官网。↩︎

  2. 2024 年丘成桐科学论坛及答辩现场记录参见 https://www.yau-awards.com/show-89-5.htmlhttps://www.yau-awards.com/show-89-6.html↩︎

Since 2023, almost every team reaching the Yau Award grand final has, to a greater or lesser degree, used generative AI, machine-learning models, or AI-assisted data-analysis tools in the course of their research. Yet unlike ISEF, which from 2024 has imposed an explicit disclosure requirement for AI use1, the S.T. Yau High School Science Award currently has no clause on AI use in its public rules. This "rule vacuum" does not mean students may use AI as they please — quite the opposite. Because the judges (overwhelmingly top scientists and front-line scholars from China and abroad) hold almost exacting standards for originality, independent thinking, and mathematical rigor, the misuse of AI can do fatal damage more easily than at ISEF. This chapter systematically reviews: the judges' true attitudes toward AI use, the differences in AI use across subjects, which tools to use at each stage of the research, how to disclose AI assistance compliantly in the paper and defense, and how to answer with confidence under a judge's probing.

Why AI Use in the Yau Award Needs Its Own Discussion

First, the Yau Award is of a piece with ISEF and the Intel Awards (now Regeneron STS), but it places more emphasis on the paper. In several public interviews, Mr. Yau has stressed that the Award is "not an exam, but the writing of a paper" — the core of the whole competition is the research report itself, in which all originality, rigor, and quality of writing are crystallized in a single text submitted to the international review committee. Once judges find traces of AI directly ghost-writing the paper, then no matter how solid the experimental data, it will be deemed academic dishonesty.

Second, the Yau Award has no on-site poster, but it does have an English defense. Unlike ISEF's trade-show format, the Yau grand final is held at Tsinghua University, conducted entirely in English and chaired by the international review committee. In the defense, judges probe deeply into the details of the paper. If a paper has been "polished" by AI but the student's oral ability falls badly short of it, the judges will see the gap at a glance.

Third, the Yau Award encourages independent thinking. In an interview Yau himself said: "In the math olympiad, a problem is set for you to solve; in the Yau Award, you set the problem and solve it yourself. This is a very important ability — in research you always have to find your own problem," and "building on existing research, adding just a little originality of your own, is already very good." This stands in sharp opposition to "let ChatGPT think of a research direction for me." A student who outsources the core research steps — topic selection, literature review, results discussion — entirely to an LLM has fundamentally betrayed the Award's founding purpose.

Fourth, the members of the Yau review committee have extremely keen academic instincts. The advisory and review committees include Fields and Nobel laureates and many members of the U.S. and Chinese academies. These scholars are long on the front line of academic-integrity review and have a strong intuition for the features of AI-generated text (overly symmetric sentence structures, vague adjectives, unnatural transition paragraphs, an over-fluent English literature review) and for the common pitfalls of ML projects (data leakage, overfitting, lack of cross-validation). In short, traces of AI are far more obvious on the Yau stage than most students imagine.

Fifth, the rule vacuum is both an opportunity and a risk. When the rules are unclear, students may mistakenly think "if I don't have to disclose, it isn't a violation." But academic integrity is a universal standard, not dependent on the specific terms of any one competition. We advise entrants to hold themselves to ISEF's disclosure standard on their own initiative and to write their AI use clearly into the Methods and Acknowledgements sections of the paper.

The Judges' True Attitudes Toward AI Use

By reviewing public interviews from the past three grand finals (2023–2025), the winning papers, and the content of the science forums2, we have distilled the following points of consensus among the judges — a consensus that appears in no official document but recurs in the judges' verbal feedback and media interviews:

  1. Judges assume you will use AI, but they will not allow you not to understand it. The gold-prize computer-science papers of both 2024 and 2025 took LLMs directly as their research object (2024 gold: LLM Mathematical Reasoning Grounded with Formal Verification; 2025 gold: PV-Care: Using Low-Density EEG and AI to Provide Proactive Help for MCI). Judges have no prejudice against a student "using AI," but zero tolerance for a student "not understanding what the AI they used is doing." "I just ran it through GPT-4o" is a death sentence in a Yau defense.

  2. Judges will ask about your training details, not concepts. Contrary to common belief, judges will not ask "what is a transformer" — they assume you know. They will ask: "How did you split your data? Are the training and validation sets from the same distribution?" "What exactly are the dimensions of your cross-attention, and why design it that way?" "Is your model's improvement over the baseline 1% or 10%? Is that difference significant at your sample size?" These questions probe real hands-on experience; last-minute cramming rarely gets past them.

  3. Mathematics judges are extremely wary of AI. The mathematics judges are mostly pure mathematicians (Yau himself, Jun Li, Xiao Jie, Zhu Xiping, and others). What they most dislike is "the AI helped me derive the theorem, the AI helped me write the proof." In a mathematics defense, being asked to write out every step of a proof on the spot is the norm; almost all gold- and silver-prize winners in mathematics can reproduce their key lemmas at the whiteboard. Let every student who wants to use AI to derive a mathematical proof take this as a warning.

  4. Judges care not only whether you used AI, but why you used it and why you used it that way. A Yau paper usually needs to show, in the defense, its research motivation (why this problem), its method (why this approach), and its significance (why this matters). If a student answers "I used ResNet for image classification" but cannot say "why not ViT" or "why not a smaller MobileNet," the judges will conclude the student is merely applying an off-the-shelf solution and lacks research taste.

  5. Judges have their own internal yardstick for originality. Yau once said in an interview, "building on existing research, adding just a little originality of your own, is already very good." Translated into the language of the AI era: judges can accept that you used AI for 90% of the engineering implementation, but you must have 10% that is truly your own, that the AI cannot replace — perhaps a new loss function, a novel data-preprocessing idea, or an original analysis of the model's failure modes.

Differences in AI Use Across Subjects

The six Yau subjects differ enormously in how much AI they accept. Applying one and the same AI strategy across different subjects is a hidden reason why a great many shortlisted entries stall at the semi-final.

Mathematics — AI is an auxiliary tool, not a collaborator. The core of the mathematics award is proof, and LLMs remain unreliable at serious proofs and often hallucinate. Judges are structurally skeptical of "AI-assisted derivation." Acceptable uses: verifying hand-derived formulas with SymPy/Mathematica; machine-verifying key steps with formal-proof systems such as Lean, Coq, or Isabelle (the 2024 gold-prize paper was exactly this approach); using an LLM to search the literature. Unacceptable uses: pasting an LLM-generated proof into the paper; letting an LLM "think up" a mathematical problem for you.

Physics — AI is reasonable in data analysis, but use it with caution in theoretical modeling. The physics award has both theoretical and many experimental/numerical directions. Judges can accept using a PINN (physics-informed neural network), using a CNN to process experimental images, or using ML for parameter inversion — the 2025 physics finalist paper PINN-LOCK: Efficient Density Current Simulation via Physics-Constrained Loss is a case in point. But beware: if your physical model itself is "I had GPT list a few equations and then plugged in numbers," the judges will expose it by probing the boundary conditions of the PDE, dimensional analysis, or the Hamiltonian.

Chemistry — the computational-chemistry direction welcomes AI; in experimental chemistry AI's value is limited. Chemistry judges have no objection at all to DFT calculations, MOF/COF design, or the use of machine-learning scoring functions in molecular-dynamics simulation — the 2024 chemistry merit paper Computational Screening and Design of Metal-Organic Frameworks for CO2 Separation is a model. But if your research is synthetic chemistry or materials preparation, AI's role should be limited to literature search and characterization-data processing, and should not replace real experiments.

Biology — AI has become almost standard, but you must manage the "data source" gate. In both 2024 and 2025, the biology award saw papers on AI-assisted drug design, single-cell analysis, and medical-image analysis (such as the 2024 merit paper AI-Guided Design and Preliminary Validation of Anti-Tuberculosis Subunit Vaccine). Biology judges are very familiar with AI tools themselves, but what they most care about is: where did your training data come from? Was there ethics approval? Did you handle batch effects and confounding? Did you have an independent validation set? If you cannot answer these, no amount of fancy AI will help.

Computer Science — AI is the discipline itself. The gold- and silver-prize papers of the past three editions are almost all in LLMs, generative models, agent systems, and multimodal learning (the 2025 silver paper CraftMesh uses Poisson fusion for generative 3D-mesh manipulation; among the bronze papers is one on LLM-based multi-agent games). Judges assume you can call the OpenAI/HuggingFace APIs, use PyTorch, and do fine-tuning. In this subject, daring to use AI is not a bonus but the price of admission. The real bonus is making an independent contribution beyond the existing large-model ecosystem — a new training paradigm, a new evaluation benchmark, a new explanation of model behavior.

Economics & Financial Modeling — AI is reasonable in data processing and modeling, but dangerous in economic reasoning. In economics & financial modeling, using machine learning to predict stock returns, NLP to process financial-report text, and reinforcement learning for trading strategies has become routine — the 2025 gold paper Firm-Level Impacts of Artificial Intelligence on Labor Demand takes AI itself as the research object. But judges intensely dislike one kind of AI use: letting an LLM "reason" about economic mechanisms for the student (as in "let GPT explain why this variable is correlated with that one"). The core of economics is causal identification, which LLMs currently cannot do at all; the moment a judge asks what your instrumental variable is or whether the parallel-trends assumption holds, the AI's answer falls apart.

The AI Tool Stack Across the Research Process

Below we recommend tools stage by stage, following the standard Yau research process — topic selection, literature review, experiment/algorithm implementation, data analysis, and paper writing. The recommendations are based on feedback from 2024–2025 winners and the hands-on experience of the Thinker Education research team.

Stage 1: Topic Selection and Feasibility Assessment

  • Perplexity (perplexity.ai): compared with traditional search engines, Perplexity gives answers with cited sources, good for quickly judging "has this idea of mine already been done in the literature?" Note: the quality of Perplexity's citations is uneven and must be verified one by one.

  • Elicit (elicit.com): an AI tool focused on academic literature; enter a research question and it returns a list of relevant papers and a summary of each paper's key findings — very useful for quickly judging how mature a direction is.

  • Consensus (consensus.app): used to find evidence for and against a specific scientific proposition (such as "is Omega-3 effective for cognitive improvement?"). Especially handy when choosing topics in economics & financial modeling and biology.

  • ChatGPT / Claude for brainstorming: you can describe your interests to an LLM, have it list 10 possible research questions, then filter. But remember: the LLM's suggestions are only a starting point; the final topic must be argued out repeatedly with your advisor and cannot be used as is.

Advice: the boundary of AI use at the topic-selection stage is "helping you diverge," not "deciding for you." Treat the LLM as a tireless discussion partner, but every candidate topic must be confirmed only after your own feasibility assessment (equipment, time, knowledge base).

Stage 2: Literature Review

  • Connected Papers (connectedpapers.com): enter a seed paper and it automatically draws a network of related papers — very efficient for discovering "key literature along the same research thread."

  • Semantic Scholar / Google Scholar: traditional but still the most authoritative academic search tools. Semantic Scholar offers AI-generated TLDR summaries that greatly improve screening efficiency.

  • Notion AI / Obsidian + the Smart Connections plugin: import your close-reading notes into Notion or Obsidian and use AI to do semantic search across your own note library, helping you quickly locate "which paper I read earlier said something similar" while writing.

  • NotebookLM (Google): you can upload several PDF documents and have the LLM answer questions only within the corpus you set. Far more reliable than having ChatGPT answer "bare," with clear citations that are easy to check.

Advice: the core goal of the literature review is "letting the judges see that you really read these papers." For any work you cite, you should be able to restate its core method, key conclusions, and limitations in the defense. Using AI tools to speed up screening is fine, but the close-reading step has no shortcut.

Stage 3: Experiment / Algorithm Implementation

  • Cursor / Windsurf (AI code editors): with the code-completion and chat abilities of GPT-4/Claude built in, good for building a complete project from scratch (data preprocessing, model training, visualization).

  • GitHub Copilot: line- and function-level completion, a clear speed-up for students familiar with a given stack (PyTorch, scikit-learn, etc.).

  • Claude / ChatGPT for debugging: when you hit a hard-to-locate error, pasting the stack trace to an LLM for analysis is more efficient than staring at the screen alone. But actively verify the LLM's explanation — it sometimes confidently gives the wrong attribution.

  • HuggingFace Transformers / Diffusers: for computer-science and biology projects that need pretrained large models, HuggingFace is the de facto tool source.

  • Wolfram Alpha / Mathematica: the standard tools for symbolic computation in mathematics and physics. Judges accept these completely.

Advice: you must understand every line of all AI-generated code — especially the model definition, loss function, and training loop. In the defense, a judge may point to a piece of pseudocode in your slides and ask "what does this step do"; failing to answer is tantamount to declaring that the code is not yours.

Stage 4: Data Analysis

  • Pandas + Matplotlib + Seaborn: the core stack for data cleaning and visualization. You can have ChatGPT write the plotting code, but the statistical meaning of every chart must be interpreted by you.

  • Statsmodels / R + tidyverse: for econometric regression and significance testing in economics & financial modeling, prefer statsmodels over sklearn — the former outputs standard errors, p-values, and robustness diagnostics, which the judges care about.

  • ChatGPT Code Interpreter (Advanced Data Analysis): you can upload a CSV and have ChatGPT run Python analysis directly. Good for exploratory data analysis (EDA), but any statistical result that goes into the paper must be reproduced in a local script — never copy ChatGPT's output directly.

  • Weights & Biases (wandb.ai): for training logs and hyperparameter comparison in deep-learning experiments. Not an AI tool, but the best evidence in the defense that you did systematic tuning.

Advice: the most common AI abuse at the data-analysis stage is having the LLM "explain what these numbers mean." Judges are very sensitive to such "after-the-fact rationalization" — you need to form your hypothesis before doing the analysis, then use the data to confirm or refute it. Handing statistical inference to AI means abandoning the soul of research.

Stage 5: Paper Writing and English Polishing

  • Grammarly / LanguageTool: for grammar, spelling, and sentence checks. This layer of AI use is uncontroversial, and judges will not object to it.

  • DeepL / Google Translate: native Chinese speakers can write a first draft in Chinese, then translate with DeepL and revise themselves. Never rely on machine translation from start to finish — the English quality of a Yau paper is an important basis for the judges' assessment of a student's overall competence.

  • ChatGPT / Claude for polishing expression: you can have an LLM polish the style of a passage, but you must not have it rewrite or reorganize the paragraph structure. A safe prompt is: "Please keep my argument structure and terminology; make only minimal changes to grammar and fluency."

  • Paperpal / Writefull: polishing tools trained specifically on academic English, more attentive to academic context (tense, passive-voice conventions) than a general LLM.

Advice: the paper's core argument, chapter framework, and key sentences (especially the last sentence of the abstract, the gap-positioning in the introduction, and the "our contribution is…" in the conclusion) must come from you. AI can help you "write it better," but it must never "think up what to write" for you.

Disclosure and Integrity: How to State AI Use in the Yau Award

Although the Yau Award currently has no mandatory disclosure clause, we strongly advise entrants to disclose AI use in the paper on their own initiative, in line with ISEF's standard from 2024. This is not only a sign of academic integrity but also puts you in a fully proactive position when a judge probes. Specific recommendations for placement and wording follow.

Acknowledgements section: disclose all non-core AI assistance together. Recommended wording example:

We acknowledge the use of ChatGPT (GPT-4, OpenAI) and Claude
(Anthropic) for language polishing and code-level debugging
assistance throughout this project. All algorithmic design,
experimental decisions, data analysis, and core conclusions
reported in this paper are the work of the authors. Generative
AI was not used to formulate research questions, generate
data, or derive theoretical results.

Methods section: when an AI tool is a substantive part of the research method, you must specify the model, version, training parameters, and evaluation metrics in the Methods. For example:

For semantic classification, we fine-tuned the BERT-base-uncased
model (Devlin et al., 2019) on a custom dataset of 4,217
manually annotated samples. Training used AdamW optimizer
(learning rate 2e-5, batch size 16) for 5 epochs on an
NVIDIA A100 GPU. Model performance was evaluated using
five-fold cross-validation with macro-F1 as the primary
metric.

How to respond in the defense to "Did you use AI?": avoid two extremes — panicked denial (easily seen through) and over-confession (reporting every trivial thing, giving the judges the impression that "most of your work was done by AI"). The recommended response framework is "layered disclosure + emphasizing your independent contribution":

Yes, in step X I used {tool} to {specific task}, mainly because {reason}. However, {core insight / decision} was made by me after {specific reasoning process}, and AI played no decisive role in this step. If you'd like the details of this part, I can derive/demonstrate it on the spot.

The core of this script is to kick the ball back to the details you are most sure of, letting the judges see that you can complete your own work without AI at any moment.

Typical Probing Questions and How to Respond (a War-Game)

Below we have organized the AI-related questions most frequently probed by judges across the three grand finals of 2023–2025, with each given a three-part analysis: "what not to say," "recommended direction," and "the trap to avoid." We advise entrants to rehearse each one against their own project before the competition.

  1. Q: "Did you use ChatGPT to write your paper?"

    • What not to say: "No, I never used it at all." (Once the judges catch LLM features in your prose, you will be judged dishonest outright.)

    • Recommended direction: admit you used it for grammar and expression polishing, make clear that the research design, literature review, and results discussion are all your own, and proactively explain that you verified every change the LLM made.

    • Trap: dressing up "I had GPT revise my abstract" as "I didn't use AI"; such a small matter, once exposed, drags down your overall credibility.

  2. Q: "Why does your model give this result?"

    • What not to say: "The model just learned it on its own."

    • Recommended direction: give a mechanistic explanation at three levels — the model architecture (e.g., how the attention mechanism captures long-range dependencies), the features of the training data, and the loss-function design — aided where necessary by an attention map or saliency visualization.

    • Trap: treating "black box" as the answer. The judge's implied question is: do you really understand what you trained?

  3. Q: "If you weren't allowed to use AI, could you still complete this research?"

    • What not to say: "No," or "Absolutely." (The former belittles yourself; the latter looks like a lie.)

    • Recommended direction: candidly point out which specific steps AI greatly compressed in time (e.g., code implementation, literature screening), but that the core ideas of the research — problem definition and method selection — were done by you independently; without AI the time cost would rise significantly, but the essential conclusions would not change.

    • Trap: showing over-reliance on AI.

  4. Q: "Where did your training data come from? How did you obtain the labels?"

    • What not to say: "I just downloaded it from the internet."

    • Recommended direction: explain the dataset's source (for public datasets, cite the original paper and license; for self-collected data, explain the collection protocol and sample size), how labels were obtained (for manual annotation, state the number of annotators and the agreement measure; for weak supervision, state the heuristic rules), and proactively discuss data bias and the limits of coverage.

    • Trap: ignoring data compliance and privacy. When human data is involved, ethics approval is the point judges most often probe.

  5. Q: "What is your baseline? Why did you choose these baselines?"

    • Recommended direction: give at least three levels of baseline: a naive method (e.g., random guessing or linear regression), a strong traditional baseline (e.g., XGBoost, SVM), and a recent SOTA in the field. Explain that you chose these three to prove from different angles that your method's improvement does not come from tuning or chance.

    • Trap: comparing only against one weak baseline, or against an unfair baseline (e.g., letting your method use a large model while the baseline uses a small one).

  6. Q: "How do you verify that the AI didn't make a mistake?"

    • Recommended direction: answer from three sides — internal validation (cross-validation, ablation studies), external validation (an independent test set or external dataset), and mechanistic validation (whether attention/feature attribution accords with domain common sense).

    • Trap: reporting only training-set accuracy. The judges will know at once that you never really did a generalization test.

  7. Q: "If I moved your method to {a related but different field}, would it still work?"

    • Recommended direction: analyze under which assumptions your method holds (data distribution, sample size, feature type), and honestly point out the changes a new field might require (re-pretraining, adjusting the architecture, introducing domain knowledge).

    • Trap: blindly claiming "it works for all fields." This is a classic signal of overconfidence.

  8. Q: "Can you summarize the core contribution of this paper you cited in one sentence?"

    • Recommended direction: for any paper you have read closely, you should be able to state its problem, method, and conclusion in 15 seconds. Every cited work must meet this bar.

    • Trap: having the LLM screen and cite a great many papers you never read. A judge can pick one at random and expose you.

  9. Q (mathematics only): "Can you rewrite the proof of this lemma/theorem on the whiteboard?"

    • Recommended direction: pick up the pen and derive it. Do not hesitate, do not refuse.

    • Trap: vaguely saying "the proof is long, I can't remember it all right now." This amounts to declaring the work is not yours.

  10. Q (economics & financial modeling only): "What is your causal-identification strategy? Does your instrumental variable satisfy the exclusion restriction?"

    • Recommended direction: clearly state the identification strategy (DID, IV, RDD, or a randomized experiment), set out the key assumptions, and give robustness checks for them (placebo, bandwidth sensitivity, etc.).

    • Trap: interpreting the correlation of a simple regression as causation. This is the most frequent point-loser in the economics & financial modeling award.

AI Application Paradigms from Past Winning Papers

To make the abstract advice above concrete, we select several representative AI applications from the 2024 and 2025 winning papers for brief analysis. The analysis below is based only on paper titles and public abstracts; we make no conjecture about undisclosed method details.

Case 1: 2024 computer-science gold prize, LLM Mathematical Reasoning Grounded with Formal Verification (BASIS International School Park Lane Harbour; He Kunlang; advisor: Junchi Yan, Shanghai Jiao Tong University). This is a paradigm-setting AI project: the researcher did not simply "call GPT to do math problems," but combined the LLM's reasoning with a formal-proof system such as Lean, having every reasoning step the LLM gives undergo machine verification. This architecture of "LLM proposes candidates + a formal system filters" is exactly the frontier direction in academia since 2023 (such as DeepMind's AlphaProof). The lesson: do not treat the LLM as a black-box endpoint; treating it as an intermediate component that needs supervision is a common feature of high-scoring projects.

Case 2: 2025 computer-science gold prize, Beyond Reactive Assistance: PV-Care Using Low-Density EEG and AI (Shanghai High School International Division). This project trains a model on low-density EEG signals to proactively identify the state of patients with mild cognitive impairment and intervene early. The core contribution is not "I used a transformer," but the coupling of EEG signal processing, AI reasoning, and a specific clinical scenario (proactive assistance for MCI patients). The lesson: what judges value most is that the AI solves a specific, socially meaningful problem, not the complexity of the technology itself.

Case 3: 2025 computer-science silver prize, CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion (Hefei Anson School). This project combines a classic graphics method (Poisson fusion) with generative AI to solve the seam problem in 3D-mesh editing. The lesson: combining domain knowledge (here, geometry processing) with generative AI is often more favored by judges than a pure-AI method — because it shows the student understands both traditional methods and frontier models.

Case 4: 2025 physics silver prize, Development of a High-Efficiency Objective-Prism Stellar Spectrograph And Construction of its Dedicated AI Classification Model (the Experimental High School Attached to Beijing Normal University). This project is a model of "hardware + AI" — the student built an objective-prism spectrograph (hardcore physics engineering) and trained an AI classification model to automatically classify the stellar spectra collected. The lesson: pure-AI projects rarely win in physics, but when AI serves as a tool that "multiplies the speed of your physics experiment," it becomes a huge bonus.

Case 5: 2024 biology merit prize, AI-Guided Design and Preliminary Validation of Anti-Tuberculosis Subunit Vaccine. This project demonstrates the standard paradigm of AI-assisted drug/vaccine design: use AI (protein-structure prediction, epitope prediction, affinity calculation) to screen candidates, then validate several top candidates by wet-lab experiment. The lesson: judges greatly value the closed loop of "AI screening + real experimental validation" — computation without experiment is seen as armchair theorizing; experiment without computation is seen as failing to use the methodology of the AI era.

Case 6: 2025 economics & financial modeling gold prize, Firm-Level Impacts of Artificial Intelligence on Labor Demand: Evidence from Online Job Postings (Shanghai Xinghe Bay Bilingual School). This project takes AI itself as the research object, building data from web-scraped job-posting text, using an NLP model to identify the AI-relevance of positions, and then doing econometric identification. The lesson: in economics & financial modeling, AI as the research object plus econometric identification as the method is the high-scoring combination recognized over the past two years.

Common Misconceptions

Misconception 1: the more AI you use, the more "modern" your paper. Judges have seen far too many papers piling on transformers, diffusion models, and reinforcement learning. The "modernity" of the model itself earns no bonus; whether the research question is worth doing and whether the AI solved the problem are the criteria. A paper using logistic regression to solve a real problem is far better than one showing off a ten-billion-parameter model without solving anything.

Misconception 2: the judges don't understand AI, so you can bluff. The Yau advisory and review committees include many professors with backgrounds in computer science, statistics, and bioinformatics, whose understanding of AI far exceeds that of almost any high-school student. Even the pure-theory judges in mathematics and physics encounter AI in their daily research far more often than students imagine. Trying to bluff judges with a pile of jargon or deliberate complexity only leads to deeper, harder follow-up questions.

Misconception 3: having an LLM write my Chinese/English abstract — no one can tell. LLM-generated text has telltale features: overly symmetric sentence structures, vague adjectives ("significant," "novel," "remarkable" appearing frequently), unnatural transition words ("moreover," "furthermore" used back-to-back), and a tendency to avoid concrete numbers. Judges — especially scholars who supervise doctoral students year-round — are very sensitive to these features. The moment your abstract's English style is clearly out of step with your spoken level in the defense, suspicion has already arisen.

Misconception 4: using an open-source model is automatically compliant. An open-source model does not mean it can be used unconditionally. Licenses differ enormously: the LLaMA series has commercial-use restrictions; many models on HuggingFace forbid use in high-risk scenarios such as medicine or law; some datasets (such as ImageNet) forbid commercial use. Judges will not check every license, but if publication or follow-up promotion is involved, license issues become a hidden hazard.

Misconception 5: AI-tool use need not be cited in the references. For AI tools that substantively participate in the research (a specific pretrained model, a particular data-analysis platform), you should cite them in the references like an academic paper. For example, using the BERT model should cite Devlin et al. (2019); using GPT-4 can cite OpenAI's technical report. For general polishing (grammar checks, etc.), a combined note in the acknowledgements suffices.

Misconception 6: the statistical results AI gives me can go straight into the paper. ChatGPT's Code Interpreter and Claude's Analysis mode can output statistical results, but these may have floating-point precision issues, version differences, or undisclosed preprocessing steps. Any statistical number that goes into the paper should be reproduced independently in your local Python/R script, and a complete runnable notebook should be archived in case a judge probes.

Checklist: AI-Use Compliance Self-Check Before Submitting Your Paper

Before you finally submit your paper to the Yau Award, check the following list item by item. Any item you cannot tick should send you back to fix it.

  1. I can explain every passage in the paper that AI assisted, and can restate its content in my own words without AI.

  2. I can derive all the key formulas/algorithm steps in the paper on the whiteboard without consulting anything.

  3. I can, without slides, fully report the research question, method, results, and conclusion by speech alone.

  4. I can state precisely the source, size, cleaning rules, label-acquisition method, and potential biases of the training data.

  5. I can state precisely the model's architecture, parameter scale, training hyperparameters (learning rate, batch size, epochs), and training hardware.

  6. I did at least one complete cross-validation or independent-test-set evaluation, the results match the metrics reported in the paper, and I kept the complete training logs (e.g., a Weights & Biases link or a local CSV).

  7. I compared against at least three baselines of differing strength (naive, strong traditional, recent SOTA) and can explain in the defense why I chose them.

  8. I clearly disclosed the scope of AI-tool use in the paper's acknowledgements and/or methods (down to the tool name, version, and purpose).

  9. I cited the original papers of the pretrained models/datasets/algorithms I used, without pasting in AI-generated citations.

  10. I have prepared a response framework for open-ended judge questions such as "what role did AI play in your research" and "could you have done it without AI."

  11. I reproduced locally any AI-generated chart or number in the paper and saved the runnable script.

  12. The key sentences of my abstract, the opening of my introduction, and my conclusion were written by my own hand, without having AI rewrite or reorganize the structure.

In closing. AI's role in the Yau Award is shifting from "needing to explain why you used it" to "needing to explain why you didn't." But even so, the judges' core concerns have never changed — whether you truly posed a question worth answering, whether you truly understand the methods you used, and whether you made even a little contribution in originality that is genuinely your own. The tools are new; the standard is old. This is precisely why the S.T. Yau Award still has the power to discriminate in the AI era.


  1. The AI-disclosure requirement added by ISEF from 2024 is detailed on the Society for Science website.↩︎

  2. Records of the 2024 Yau science forum and defense sessions are available at https://www.yau-awards.com/show-89-5.html and https://www.yau-awards.com/show-89-6.html.↩︎