Course syllabus
Welcome to this course on the fundamentals of machine learning for natural language processing!
This course is divided into 5 modules. The first four modules, each consisting of three lectures and one assignment, will guide you from simple text encoding up to neural models. The last module consisting of one lecture and a text seminar, where you will be given an introduction to ethics in machine learning. Each lecture will typically consist of 45-60 minutes of theory, a mini-lab, and finally, a presentation/discussion of mini-lab results during the last 5-15 minutes. The mini-labs are there to anchor newly acquired theoretical knowledge to a real task, but also to incrementally build up a code base for future experiments. Each mini-lab usually has several design/data choices. Note that there is no expectation of “finishing” a mini-lab during the lecture. Please experiment with them outside of class as an exercise. Course examination is done though four assignments and a text seminar.
The course has two tracks for readings. The fundamental track gives you an understanding of the course content while also reviewing the basics (from earlier courses). The advanced track goes beyond the material covered in the lectures. These extra readings are completely voluntary and are not required for passing the course and will not be discussed in-depth in the lectures. You are, however, welcome to ask questions about them.
Lectures
All lectures will be given both on campus and by video link (invitation). Each module has its own lecture plan with more detailed descriptions of the lectures, with reading material, slides, and code.
Module 1: Fundamentals of modelling
|
In the first part of the course, we will discuss the basics of modelling. The core concepts are: encoding text as vectors, basic classification and regression models, and how to choose parameters for these models. Some themes in this module are: [lecture plan]
|
A visualization of "Alice in Wonderland" using a t-SNE embedding of GloVe vectors. Blobs are coloured by POS tag and scaled in proportional to the word frequency. [code] |
Module 2: Model Selection & (Un-)Supervised Learning
|
|
The main focus of this part of the course is to get deeper into different types of models and ways of learning. We will talk about how to create a model without labelled data and model parameter sensitivity. Maybe mos importantly, we will try out several types of classifiers (nonlinear, structured prediction, ) that have been successful in NLP. Some themes in this module are: [lecture plan]
|
Module 3: Fundamentals of Neural networks
|
This part will be about the core components of neural networks. We will talk about designing and training small networks to solve NLP problems such as POS tagging. Some themes in this module are: [lecture plan]
|
|
Module 4: Machine Learning Applications
|
|
As this course doesn't presuppose any knowledge of machine learning, the first three parts are dedicated to grasping the basics of the multitude of concepts in ML. With a good grasp of the fundamentals, we can now focus more on applications of, primarily, neural models. Some themes in this module are: [lecture plan]
|
Model 5: Ethics and ML
|
Teaching modern machine learning without some insight into ethics could be considered unethical. Also, "AI ethics" is increasingly being written about and researched. In this part, you will be given a very brief introduction to ethical philosophy. Focus will be on how to think, not what to think. [lecture plan]
|
|
Literature
For the first two parts (half of the course), we will be using the book “An Introduction to Statistical Learning: with Applications in R” as the main literature. This will be abbreviated as ST in the reading instructions. The implementations in the book are written in R, which is a programming language that is especially popular among statisticians and bioinformaticians. We will, however, keep using python, as it is much more popular in computational linguistics (among other fields). You can find reimplementations of the examples from the book in python here (I haven't looked at them properly, so I can't vouch for the quality). The book is available for free from the authors' website. In addition, you will be given several items for reading per lecture. These can be found under "literature" for each lecture/assignment. All reading items will be split into a fundamentals track and an advanced track. The advanced track reading items will be marked with "(A)" and are voluntary. The course is designed so that you can pass (G) without having read the advanced track. In addition, we will make heavy use of manuals for the respective software packages used throughout the course.
ST: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). “An Introduction to Statistical Learning: with Applications in R.” Springer New York., https://statlearning.com/
Intended Learning Outcomes
The course syllabus states five learning outcomes. Here follows a short description of their relation to the course material.
1. apply basic principles of machine learning to natural language data;
The majority of the data we will be working on is natural language. Starting from the first lecture, several ways of encoding this type of data will be discussed. The span will be from binary bags-of-words (lectures 1&3, assignment 1) to different types of embeddings (lectures 1&8+, assignment 3&4). Basic methodology like cross validation also falls under this learning outcome.
2. apply probability theory and statistic inference on linguistic data;
Basic n-gram models and word statistics have been used in earlier courses. Here, these concepts will be extended to both vector spaces and as input to probabilistic classifiers (e.g., Naive Bayes). Several lectures will discuss probabilistic perspectives on both feature and parameters spaces, starting with lecture 2.
3. use standard software packages for machine learning;
In order to work more with the core functionality of models, the course does not involve too many ML frameworks. Avoiding black boxes is of weight when studying the basics. However, several external libraries will be introduced, though most modelling is going to be written in sklearn and pyTorch. This learning outcome also includes code quality. Using a standard software packages includes writing understandable code with python and numpy.
4. implement linear models for classification;
This is introduced in lecture three and will be expanded upon throughout the course (including SVMs, feedforward networks etc). This is the main theme for assignment 1 and will come back in modified form as smaller parts of the other assignments.
5. design simple neural nets using some standard library.
This is the core of the second half of the course. We will be using pyTorch.
Examination
The course is examined by four assignments (handed in though studium) and a seminar. To pass the course ("Godkänt", G), you must pass all four assignments and the seminar. To pass the course with distinction ("Väl godkänt", VG), at least three of the individual assignments must be passed with distinction. All assignments will be distributed as ipython notebooks, which is also the hand-in format in studium. Note that the course includes additional ungraded mini-labs and exercises, which are not part of the examination.
| First deadline | Second deadline | |
| Assignment 1: Sentiment Polarity for Movie Reviews | 22 April | 13 June |
| Assignment 2: Probabilistic Document Classification | 29 April (peer-review), 6 May (final) | 13 June |
| Assignment 3: Recurrent Network of Part-of-speech Tagging | 20 May | 13 June |
| Assignment 4: Gendered Directions in Embeddings | 3 June | 20 June |
| Ethics seminar |
If you miss a submission deadline, or you do not pass the assignment, you can re-submit your assignment up to the resubmission deadline. Please also take notice of our general course assessment and examination policy. If there are special circumstances that make a regular submission impossible, please inform us in good time before a deadline.
Course summary:
| Date | Details | Due |
|---|---|---|