Reading list:
Hinton & Sejnowski (1983).
Optimal Perceptual Inference (Boltzmann machines).
Rumelhart, Hinton & Williams (1986).
Learning representations by back-propagating errors (Backpropagation). Nature, Vol. 323, pp. 533–536.
LeCun, Bottou, Bengio & Haffner (1998).
Gradient-based learning applied to document recognition (LeNet-5). Proceedings of the IEEE, Vol. 86, pp. 2278–2324.
Bengio, Ducharme, Vincent & Janvin (2003).
A neural probabilistic language model [link 2] (Bengio NPLM -- Distributed embeddings). J. Mach. Learn. Res., Vol. 3, pp. 1137–1155.
Krizhevsky, Sutskever & Hinton (2012).
ImageNet Classification with Deep Convolutional Neural Networks (AlexNet). In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS), Vol. 1, pp. 1097–1105.
Kingma & Welling (2013).
Auto-Encoding Variational Bayes (VAE). CoRR, Vol. abs/1312.6114.
Bahdanau, Cho & Bengio (2014).
Neural Machine Translation by Jointly Learning to Align and Translate (Attention in seq2seq models). CoRR, Vol. abs/1409.0473.
Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville & Bengio (2014).
Generative adversarial nets [Survey 1] [Survey 2] [Image to image] [Transformers + GANs] (GANs). In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, pp. 2672–2680.
Kingma & Ba (2014).
Adam: A Method for Stochastic Optimization (Adam). CoRR, Vol. abs/1412.6980.
Simonyan & Zisserman (2014).
Very Deep Convolutional Networks for Large-Scale Image Recognition (VGG-16). CoRR, Vol. abs/1409.1556.
Sutskever, Vinyals & Le (2014).
Sequence to sequence learning with neural networks (Encoder-decoder / seq2seq). In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, pp. 3104–3112.
Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke & Rabinovich (2014).
Going deeper with convolutions (Inception / GoogLeNet). 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1-9.
Redmon, Divvala, Girshick & Farhadi (2015).
You Only Look Once: Unified, Real-Time Object Detection [v2] [v3] [v4+] (YOLO algorithm). 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779-788.
Ronneberger (2015).
U-Net: Convolutional Networks for Biomedical Image Segmentation (U-net). In Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2015, pp. 234–241.
Schroff, Kalenichenko & Philbin (2015).
FaceNet: A Unified Embedding for Face Recognition and Clustering (FaceNet). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Sohl-Dickstein, Weiss, Maheswaranathan & Ganguli (2015).
Deep unsupervised learning using nonequilibrium thermodynamics (Early diffusion models). In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, pp. 2256–2265.
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand & Lempitsky (2016).
Domain-adversarial training of neural networks (DANN). J. Mach. Learn. Res., Vol. 17, pp. 2096–2030.
He, Zhang, Ren & Sun (2016).
Deep Residual Learning for Image Recognition (ResNet). In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770-778.
Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto & Adam (2017).
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (MobileNet v1). ArXiv, Vol. abs/1704.04861.
Kaiser, Gomez, Shazeer, Vaswani, Parmar, Jones & Uszkoreit (2017).
One Model To Learn Them All (?). ArXiv, Vol. abs/1706.05137.
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser & Polosukhin (2017).
Attention is all you need (Transformers). In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010.
Radford & Narasimhan (2018).
Improving Language Understanding by Generative Pre-Training (GPT1, OpenAI / autoregressive pretraining).
Sandler, Howard, Zhu, Zhmoginov & Chen (2018).
MobileNetV2: Inverted Residuals and Linear Bottlenecks (MobileNet v2). 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510-4520.
Devlin, Chang, Lee & Toutanova (2019).
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (BERT / Masked Language Modeling / pretraining -> fine tuning). In North American Chapter of the Association for Computational Linguistics.
Tan & Le (2019).
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (EfficientNet). In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, pp. 6105–6114.
Ho, Jain & Abbeel (2020).
Denoising Diffusion Probabilistic Models (DDPM / Diffusion models / VAE). In Advances in Neural Information Processing Systems, Vol. 33, pp. 6840–6851.
Bank, Koenigstein & Giryes (2021).
Autoencoders.
Dao, Fu, Ermon, Rudra & Ré (2022).
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (FlashAttention - Fast algorithm to compute attention).
Michelucci (2022).
An Introduction to Autoencoders.
Gu & Dao (2024).
Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Mamba - Evolution of LSTM).
Behrouz, Razaviyayn, Zhong & Mirrokni (2025).
Nested Learning: The Illusion of Deep Learning Architectures [More papers by A. Berhouz] (Nested Learning / Hope). In The Thirty-ninth Annual Conference on Neural Information Processing Systems.
Chong (2025).
Attention Is Not What You Need (Grassmann flows).
Agarwal, Dalal & Misra (2026).
The Bayesian Geometry of Transformer Attention.