Charlie F. Welch | COMPSCI 4NL3

Natural Language Processing

Winter 2025

Overview: Natural language processing with a focus on recent developments: Basics of data processing and machine learning in the context of NLP, text classification and sequence labeling, vector semantics, embeddings, language modeling, translation, question answering, parsing, dialogue and conversational systems, discourse processing, and large language models. Practices and developments will be discussed with practical exercises focusing more heavily on recent neural network approaches.

Rationale: Artificial intelligence has exploded over the past decade and the processing of natural language has been a central part of this growth. This course will cover the fundamentals of the field of natural language processing while focusing on recent developments primarily related to the use of deep neural networks and how they have enabled large language models. The course will be useful for students with some background in programming and data analytics who are interested in learning more about language use in artificial intelligence.

Books: The following books are relevant to the course but the Jurafsky & Martin book will be more directly applicable to what we cover in the course.

Relation to Other Courses: This course does not go in-depth into machine learning. Assignments and lectures focus on the Python language and require programming background. COMPSCI 4O03 or 4TE3 may help to more deeply understand some of the methods presented in the course, but the basics of experimentation and model development in natural language processing will be covered in earlier lectures to set the foundation without going into much mathematical depth. This background material covers some of the material from COMPSCI 4ML3, though much more briefly, as a refresher, and from the perspective of NLP problems specifically. COMPSCI 4AL3 covers preliminary concepts of NLP but focuses on machine learning practice in applied settings more generally, while this course explores the breadth of NLP topics in more detail, as it is the focus of the entire course.

Tentative Schedule: The lectures for the course cover the following topics. This is subject to change

W1: Intro to Natural Language Processing: Text processing basics, regular expressions
W2: Supervised Text Classification: Logistic regression, naive Bayes, introduction to neural networks
W3: Language Models: N-gram models, neural models, machine translation
W4: Word Vectors: Latent semantic analysis, word2vec, GloVe, contextual embeddings, morphology
W5: Sequence Labeling: Hidden markov models, part-of-speech tagging, LSTM networks
W6: Syntax: Context-free parsing, dependency, constituency, semantic parsing
W7: Data Collection: Annotation, crowd sourcing, bias, ground truth, subjectivity
W8: Unsupervised Learning: Clustering, expectation maximization, topic modeling
W9: Neural Networks: BERT, Transformers, reinforcement learning with human feedback, safety, efficiency
W10: Discourse: Narrative, chatbots, co-reference
W11: Lexical Semantics: Word sense disambiguation, entity linking and detection, semantic roles and frames, diachronic change
W12: Application Areas: Sociolinguistics, social NLP, information extraction, inference, question answering

Course Work and Grading: Your grade will be comprised of the following.

Four programming assignments
Class project - 4 stages (annotation task design, annotation, baseline design, modeling for challenge competition)
Midterm Exam
Final Exam