While I am still affiliated with The University of Melbourne and actively
engaged with various ongoing activities, I'm currently based
at MBZUAI and unable to take on any new
students or projects at Melbourne.
Prospective research students/interns, please read this before contacting
me.
Major Projects
Present
- AI for Legal Problem Diagnosis in the Diverse Language of Australians (ARC Linkage Project, 2023—2026)
Past
- Fairness in Natural Language Processing (ARC Discovery
Project, 2020—2023)
- ARC Centre in Cognitive Computing for Medical Technologies (ARC
ITRP, 2018–2023)
- Biochemical text mining for advancing chemical and pharmaceutical
knowledge (ARC Linkage Project, 2018—2022)
- Whopping Volta GPU Cluster (ARC LIEF Project, 2020)
- Making Computers Understand Common Language about Place (ARC
Discovery Project, 2017–2019)
- User-Adaptive Search and Evaluation for Complex Information-Seeking
Tasks (ARC Linkage Project, 2016–2019)
- Personalised Topic Modelling and Sentiment Analysis for Enhanced
Information Discovery over Document Streams (ARC Linkage Project,
2013–2017)
- VetCompass: Big Data and Real-time Surveillance for Veterinary
Science (ARC LIEF Project, 2016)
- Information access through web-scale question-answer pair
finding, ranking and matching (ARC Future Fellowship,
2013—2016)
- Talking about Place — Tapping Human Knowledge to Enrich
National Spatial Data Sets (ARC Linkage Project,
2011–2014)
- Principles, Practice, and Pragmatics of Measurement in Experimental
Computer Science (ARC Discovery Project, 2011–2014)
- NICTA
biomedical text mining (with Verspoor, Cavedon, Zobel, Moffat et al.)
- OLE (ARC Discovery Project, with Bird: Online Linguistic Exploration: Deeper,
Faster, Broader Language Documentation, 2009–2011)
- Kubadji (ARC Discovery Project, with
Zukerman, Sonenberg, Balbo and Bird: Personalised Content Delivery
for Assisted Navigation of Information Rich, Physical Environments such
as a Museum, 2007–2010)
- Web-scale Language Identification: All Languages Great and Small
(Google Research Award, 2008–2009)
- Multilingual Unsupervised Parse Selection (Microsoft Research Asia Research
Award, 2009–2010)
- Web User Forum Text Analysis (Microsoft Research Asia Research
Award, 2008–2009)
- Information Delivery from Segmented Textual Data Streams (ARC
Discovery Project: 2006–2008)
- Scalable Language Understanding for Japanese (joint research project
with NTT Communication Science Labs., 2006–2008)
- Interactive Information Discovery and Delivery (NICTA project, with
Cavedon, Stokes, Bird, Moffat, et al., 2005–2007)
- An Intelligent Search Infrastructure for Language Resources on the
Web (ARC e-Research Special Research Initiative, with Bird and Hughes, 2006)
- Feature-rich Word Sense Disambiguation and Unknown Word
Bootstrapping (joint research project with NTT Communication Science
Labs., 2004–2006)
Publications
See my publications page for a
reasonably up-to-date list of my papers (with links to most papers). My Google
Scholar profile and Semantic
Scholar profile are also a reasonably accurate snapshot of my
publication output.
Talks
Slides from recent(ish) talks:
Resources
Old Software (not updated since 2016; more recent repositories
are linked from associated papers)
- pigeo: an
automatic geotagging tool, based on text- and graph-based methods
[developed in collaboration with Afshin Rahimi and Trevor Cohn]
- LexSemTM: code for
training topic models to estimate word sense distributions at scale
[developed in collaboration with Andrew Bennett, Jey Han Lau, Francis Bond
and Diana McCarthy]
- polyglot: a language
identification toolkit for multilingual documents [developed in
collaboration with Marco Lui and Jey Han Lau]
- Toolkit for
evaluating topic coherence and topic model quality: toolkit for
evaluating the semantic coherence of individual topics and overall topic
models, as described in our EACL 2014 paper [developed in
collaboration with Jey Han Lau and Dave Newman]
- Twitter user
geolocator: trained models and full code to replicate the Twitter user
geolocation experiments published in our ACL 2013 demo paper [developed in
collaboration with Bo Han and Paul Cook]
- HDP-based word sense
induction system: toolkit for inducing word senses based on a
Hierarchical Dirichlet Process (HDP) [developed in collaboration with Jey
Han Lau and Paul Cook]
- On-line Topic Modeller:
implementation of an on-line topic modeller for trend analysis [developed
in collaboration with Jey Han Lau]
- langid.py: fast, accurate standalone
language identification toolkit; also versions of the pre-trained
identifier in C and Javascript [developed in
collaboration with Marco Lui]
- SiteScraper:
automatically scrapes data from websites based on a handful of sample
URLs and strings of interest [developed in collaboration with Richard
Penman and David Martinez]
- Hydrat:
Python library for text categorisation/language identification
[developed in collaboration with Marco Lui]
- Malay
tokeniser/lemmatiser: lex/perl tools for tokenising and lemmatising
Malay text
Old Models/Datasets (up until 2016; see more recent work, see
the associated paper(s) for links)
- Pre-trained doc2vec models
for English (described in Lau and Baldwin, 2016)
- LexSemTM
(largely) all-vocabulary trained topic models for English (described
in Bennett et al., 2017)
- CQADupStack
dataset of duplicate questions from StackExchange (described in Hoogeveen et al., 2015)
- Financial agreement named entity
dataset (described in Salinas et al., 2015)
- City label set for user
geolocation (described in Han et al.,
2014; with thanks to Mark Dredze)
- W-NUT 2015
Shared Task on Lexical Normalisation for English Tweets (described in
Baldwin et al., 2015)
- Locative Expressions in Social
Media Text dataset (described in Liu et al., 2014)
- Novel sense dataset
(described in Cook et
al., 2014)
- Twitter and Web lexical sample
sense annotations (described in Gella et al.,
2014)
- Twituser language
identification dataset (described in Lui and Baldwin, 2014)
- Topics
annotated for observed coherence (described in Lau et al., 2014)
- Multilingual language
identification dataset (described in Lui et
al., 2014)
- Lexical normalisation
dictionary (described in Han et al., 2012)
- Japanese
SemCor (described in Bond et al., 2012)
- Multi-domain language
identification dataset (from Lui and Baldwin, 2011)
- Topic
label dataset (described in Lau et al., 2011)
- Lexical normalisation dataset
(described in Han and
Baldwin, 2011, incorporating corrections from Jacob Eisenstein); (old
version: v1.1)
- Multilingual
language identification dataset (as used in the ALTA-2010
Shared Task, and described in Baldwin
and Lui, 2010)
- Web
user forum thread and post structure dataset (described in Kim et al., 2010 and
Wang
et al., 2010)
- Topic
coherence topics and human judgements (described in Newman et al., 2012)
- Language
identification dataset (described in Baldwin and Lui, 2010)
- Case
and punctuation restoration dataset (described in Baldwin and Joseph, 2009)
- Satire
document collection (described in Burfoot and
Baldwin, 2009)
- Tagalog
predicate-argument parsing dataset (described in Mistica and
Baldwin, 2009)
- Pooled kanji
similarity dataset (described in Yencken and Baldwin,
2008)
- Noun-noun
compound semantic relations (described in Kim and
Baldwin, 2008)
- Compound
nominalisation interpretation (described in Nicholson
and Baldwin, 2008)
- Deep
lexical acquisition of English verb-particle constructions (described
in Baldwin,
2008)
- Parsing and WSD dataset (described in Agirre et
al., 2008) — email me for access details
- Kanji
similarity dataset (described in Yencken
and Baldwin, 2006)
- Japanese
grapheme-phoneme alignment data (described in Baldwin and
Tanaka, 1999)
Miscellaneous
Teaching
Past
- COMP10001 Foundations of Computing (Semester 1,
2012—2021; variously co-lectured with Andrew Turpin, Egemen
Tanin, Nic Geard, and Marion Zalk)
- Lecture series
on User
Generated Content presented as part of
the first Advanced Language Processing
School (ALPS) (2021)
- COMP30027 Machine Learning (Semester 1, 2017 and 2019; co-lectured with Karin
Verspoor, Afshin Rahimi, and Jeremy Nicholson)
- Lecture series on Social
Media and Text Analytics presented as part of the International Summer School on
Web Science and Technology (2016)
- Series of guest lectures on Text Analysis of Social Media
presented as part of Language Technology II (Saarland University, August,
2014)
- COMP90051 Statistical and Evolutionary Learning (Semester 2, 2011;
co-lectured with Michael Kirley)
- COMP30018 Knowledge Technologies (2010—2012)
- INFO10001 Informatics 1 (2008—2011)
- Empirical
Approaches to Multilingual Lexical Acquisition (Saarland University,
Winter, 2008; taught as part of the Erasmus Mundus LCT Masters program)
- 433-352 Data on the Web (2006—2009)
- 433-484/684 Machine Learning (2006— 2008)
- ESSLLI 2006 course on
Data-Driven Methods for Acquiring Linguistic Information (with Aline
Villavicencio, Anna Korhonen and Valia Kordoni)
- Lexical semantics course convener and co-lecturer for ACL/HCSNet Advanced
Program in Natural Language Processing (2006)
- 433-253 Algorithms and Data Structures (2006—2007; co-lectured
with Linda Stern)
- 433-395 Advanced Topic in Computer Science (2005)
- 433-680 Machine Learning (2005)
- An Introduction to Computational Word Learning (Stanford
University, Fall Quarter, 2003)
Staff
Present
- Xudong Han (Research Fellow 2023—, MBZUAI)
- Masahiro Kaneko (Research Fellow 2023—, MBZUAI)
- Kemal Kurniawan (Research Fellow 2023—, The University of Melbourne)
- Haonan Li (Research Fellow 2022—, MBZUAI)
- Artem Shelmanov (Research Fellow 2022—, MBZUAI)
- Zeerak Talat (Research Fellow 2023—, MBZUAI)
Past
- Fajri Koto (Research Fellow 2022—2024, MBZUAI)
- Simon Šuster (Research Fellow 2020—2024, The University of Melbourne)
- Biaoyan Fang (Research Fellow 2022—2023)
- Victor Fedyashov (Research Fellow 2019—2022)
- Aili Shen (Research Fellow 2020—2022)
- Meladel Mistica (Research Fellow 2019—2021)
- Shivashankar Subramanian (Research Fellow 2020—2021)
- Afshin Rahimi (Research Fellow 2018—2019)
- Bahar Salehi (Research Fellow 2017—2019)
- Julian Brooke (McKenzie Postdoctoral Fellow 2015—2017)
- Andrew Bennett (Research Associate 2016—2017)
- Huizhi Liang (Research Fellow 2014—2016)
- Joel Nothman (Research Fellow 2015—2016)
- Angelos Molfetas (Research Fellow 2014)
- Yvette Graham (Research Fellow 2012—2014)
- Paul Cook (McKenzie Postdoctoral Fellow 2011—2014)
- Jey Han Lau (Research Fellow 2013)
- Rebecca Dridan (Research Fellow working on OLE 2009—2011)
- Gintarė Grigonytė (Visiting Research Fellow 2011)
- Su Nam Kim
(Research Fellow working on LangID and ILIAD 2009—2010)
- Patrick Ye (Research Fellow working on Kubadji 2009—2010)
- David Martinez (Research Fellow working on ILIAD 2007—2009)
- Marco Lui (Research Assistant working on ILIAD and LangID 2009—2010)
- Richard Penman (Research Assistant working on ILIAD 2008—2009)
- Shlomo Berkovsky (Research Fellow 2007—2008)
- Kapil Gupta (Research Fellow 2009)
Students
Present
- Sayantan Dasgupta (PhD student at The University of Melbourne; co-supervised with Trevor Cohn)
- Junjie Gao (MSc student at MBZUAI)
- Inoue Go (PhD student at MBZUAI; co-supervised with Nizar Habash)
- Anirudh Joshi (PhD student at The University of Melbourne; co-supervised with Richard Sinnott and
Cecile Paris)
- Mark Summerfield (PhD student at The University of Melbourne; co-supervised with Andrew Christie)
- Hung Thinh Truong (PhD student at The University of Melbourne; co-supervised with Karin
Verspoor and Trevor Cohn)
- Gisela Vallejo (PhD student at The University of Melbourne; co-supervised with Lea Frermann)
- Dalin Wang (PhD student at The University of Melbourne; co-supervised with Ed Hovy)
- Renxi Wang (MSc student at MBZUAI)
- Rui Xing (PhD student at The University of Melbourne; co-supervised with Jey Han Lau)
- Jinrui Yang (PhD student at The University of Melbourne; co-supervised with Trevor Cohn)
- Xiyuan (Emma) Zhang (PhD student at The University of Melbourne; co-supervised with Noel Faux
and Ben Goudey)
Past
- Yichen Huang (MSc student at MBZUAI)
- Qisheng Liao (MSc student at MBZUAI)
- Shraey Bhatia (PhD student at The University of Melbourne; co-supervised with Jey Han Lau)
- Takashi Wada (PhD student at The University of Melbourne; co-supervised with Jey Han Lau)
- Yulia Otmakhova (PhD student at The University of Melbourne; co-supervised with Karin Verspoor
and Jey Han Lau)
- Xudong Han (PhD student; co-supervised with Trevor Cohn)
- Yuxia Wang (PhD student; co-supervised with Karin Verspoor)
- Haonan Li (PhD student; co-supervised with Martin Tomko and Maria Vasardani)
- Fajri Koto (PhD student; co-supervised with Jey Han Lau)
- Biaoyan Fang (PhD student; co-supervised with Karin Verspoor)
- Brian Hur (PhD student; co-supervised with James Gilkerson, Laura
Hardefeldt, and Karin Verspoor)
- Yingrui Zhang (MDSc student; co-supervised with Victor Fedyashov and Ben Goudey)
- Andrew Shen (undergrad student; co-supervised with Jey Han Lau and
Fajri Koto)
- Nitika Mathur (PhD student; co-supervised with Trevor Cohn)
- Qinyu Bai (MDSc student; co-supervised with Victor Fedyashov and Ben Goudey)
- Siyang Wang (MSc(CS) student; co-supervised with Simon Šuster)
- Chenbang Huang (MSc(CS) student; co-supervised with Aili Shen)
- Qian Sun (MSc(CS) student; co-supervised with Aili Shen)
- Wayan Oger Vihikan (MIT student; co-supervised with Meladel Mistica)
- Fan Ye (MSc(CS) student; co-supervised with Simon Šuster)
- Shuanglong You (MSc(CS) student; co-supervised with Victor Fedyashov)
- Aili Shen (PhD student; co-supervised with Jianzhong Qi and
Bahar Salehi)
- Shivashankar Subramanian (PhD student; co-supervised with Trevor
Cohn)
- Saumya Pandey (MSc(CS) student; co-supervised with Lea Frermann)
- Haowen Tang (MSc(CS) student)
- Gaurav Arora (MSc(CS) student; co-supervised with Afshin Rahimi)
- Yitong Li (PhD student; co-supervised with Trevor Cohn)
- Fei Liu (PhD student; co-supervised with Trevor Cohn)
- Adel Foda (PhD student; co-supervised with Jey Han Lau)
- Tatsuya Aoki (visiting PhD student from Tokyo Institute of Technology)
- Leo Bouillet (MSc(CS) student)
- Jun Wang (MSc(CS) student; co-supervised with Graeme Gange)
- Karen Qu (MIT student; co-supervised with Afshin Rahimi)
- Ekaterina Vylomova (PhD student; co-supervised with Trevor Cohn)
- Navnita Nandakumar (MSc(CS) student; co-supervised with Bahar Salehi)
- Qianji Di (MIT student; co-supervised with Ekaterina Vylomova)
- Jinxiang Wang (MSc(CS) student)
- Doris Hoogeveen (PhD student; co-supervised with Karin Verspoor)
- Afshin Rahimi (PhD student; co-supervised with Trevor Cohn)
- Ned Letcher (PhD student; co-supervised with Emily Bender)
- Jingyuan Zhang (MIT student)
- Jim Breen (PhD
student; co-supervised with Francis Bond)
- Steven Xu (MSc(CS) student; co-supervised with Jey Han Lau)
- Katharine Cheng (MSc(CS) student; co-supervised with Karin Verspoor)
- Richard Fothergill (PhD student — currently working at
rome2rio)
- Viet Nguyen (MSc(CS) student; co-supervised with Julian Brooke)
- Shraey Bhatia (MSc(CS) student; co-supervised with Jey Han Lau
— currently a PhD student at The University of Melbourne)
- King Chan (completed PGDip studies in 2017; co-supervised with Julian Brooke)
- Ionut-Teodor Sorodoc (visiting Masters student in 2016; co-supervised with
Jey Han Lau)
- Bahar Salehi (completed PhD 2016; co-supervised with Paul Cook
— currently working at Go1)
- Andrew Bennett (completed MSc(CS) 2016; co-supervised with Jey Han Lau,
Francis Bond, Diana McCarthy and Paul Cook — currently a PhD student
at Cornell University)
- Liang Han (completed MIT 2016)
- Michael Niemann
(completed PhD 2015; co-supervised with Henry Linger — currently
working at Monash University)
- Julio Salinas (completed MIT 2015; co-supervised with Karin Verspoor)
- Marco Lui (completed PhD 2015 — currently working at
rome2rio)
- Li Wang (completed PhD 2015;
co-supervised with Su Nam Kim — currently working at Dropbox)
- Nitika Mathur (completed MSc(CS) 2014; co-supervised with Yvette Graham)
- Bo
Han (completed PhD 2014; co-supervised with Paul Cook)
- Xiwei Wang (completed MSc(CS) 2014; co-supervised with Yvette
Graham — currently working at Alibaba)
- Andrew Chester (completed MSc(CS) 2014; co-supervised with Tony
Wirth)
- Jared Willett (completed MSc(CS) 2012; co-supervised with David Martinez
and Angus Webb)
- Siming Wang (completed PGDip 2013; co-supervised with Alistair
Moffat)
- Meladel Mistica (completed PhD 2013; external supervisor —
currently working at The University of Melbourne)
- Spandana Gella (completed MSc(CS) 2013; co-supervised with Paul
Cook — currently working at Amazon)
- Jey Han Lau (completed PhD 2013; co-supervised with Dave Newman —
currently working at The University of Melbourne)
- Clint Burford (completed PhD 2013; co-supervised with Steven Bird
— currently working at Apple)
- Willy Yap (completed PhD 2013; co-supervised with Tara McIntosh —
currently working at Sportsbet)
- Luke Parkinson (MSc(CS); co-supervised with Paul Cook)
- Matěj Korvas (completed MSc(CS) 2012)
- Igor Tytyk (completed MSc(CS) 2012 —
currently working at Grammarly)
- Andrew MacKinlay (completed PhD 2012 —
currently working at culture amp)
- Karl Grieser (completed PhD 2012 —
currently working at Redbubble)
- Ned Letcher (completed BSc(Hons) 2010)
- Lars Yencken
(completed PhD 2010)
- Marco Lui (completed BCS(Hons) 2009 —
currently working at Rome2rio)
- Ben White (completed MIT 2009)
- Li Wang (completed MIT 2009)
- Patrick Ye (completed PhD 2009 —
currently working at Amazon)
- Lejoe Kuriakose (completed MEDC 2008)
- Paul Joseph (completed MSSE 2008)
- Su Nam Kim (completed PhD 2008)
- Michael Yang (completed BCS(Hons) 2007)
- Sumukh Ghodke(completed MSSE 2007)
- Phil Blunsom (completed PhD 2007 —
currently working at Oxford University/Cohere AI)
- Edward Ivanovic (completed MPhil 2007)
- Aidan Furlan (completed BCS(Hons) 2006)
- Karl Grieser (completed BSc(Hons) 2006)
- Rebecca Dridan (completed MPhil 2006)
- Jeremy Nicholson (completed BCS(Hons) 2005)
Interested in pursuing natural language processing research at The
University of Melbourne? Contact me directly, making sure to include a CV
and description of your research interests.
Professional Activities
Present
- Permanent Member of the International Committee on Computational Linguistics (2014—)
- Advisory Board for ACL
SIGLEX (Special Interest Group on the Lexicon) (2014—)
- Advisory Board for ACL
SIGDAT (Special Interest Group for linguistic data and corpus-based approaches to NLP) (2014—)
- Editorial board of Transactions of the Association for Computational
Linguistics (2015—) and ACL Rolling Review (2021—)
Past (highlights)
Random Miscellania
In the media (up until 2021; not updated since moving to MBZUAI):
- Ingenium: Building
fairness into AI from the ground-up (12/8/2021)
- TOPBOTS: GPT-3 & Beyond: 10 NLP Research Papers You Should Read (17/11/2020)
- The
Australian: Engineering
& Computer Science Australia’s Research Field Leaders
(23/9/2020)
- ABC News: Donald
Trump, QAnon and the limit of Twitter's crackdown on conspiracies (2/9/2020)
- Slator: And
the Winner Is ... ACL 2020 Announces Best Paper Awards, (9/7/2020)
- IEEE Spectrum
This
AI Poet Mastered Rhythm, Rhyme, and Natural Language to Write Like
Shakespeare, (30/4/2020)
- The Australian: Engineering & Computer Science Australia’s Research Field Leaders (10/6/2019)
- Australian Financial
Review: IBM
to build $10 million AI centre with Melbourne Uni (10/6/2019)
- New
Scientist News, The
Times, Daily Mail, Digital Trends, NVIDIA, InfoSurHoy, la Repubblica, BBC Radio 4: Deep-speare — A joint neural model of
poetic language, meter and rhyme (7/2018)
- ABC
News: Can we Replace Red Symons with a Robot? (3/10/2017)
- Crikey:
How can You Tell if a Tweet is Credible? (6/3/2017)
- Farrago:
The Revolution Will Be Computerised (29/8/2016)
- Tech
Exec: The Fourth Revolution: Artificial Intelligence
(29/1/2016)
- MIT
Technology Review: King – Man + Woman = Queen: The Marvelous
Mathematics of Computational Linguistics (17/9/2015)
- NCI
News: Real-time Twitter Mining (30/9/2014)
- The
Age: The Rise of Artificial Intelligence (23/1/2014)
- Oregonian:
Tweet Talk (24/6/2011)
- Sydney
Morning Herald: Big Brains Coming back to Melbourne (6/12/2005)
- UniNews:
Reversing the Brain Drain (14/11/2005)
In a moment of weakness, I signed up for LinkedIn.
For the trivia lovers, here is my (almost certainly outdated) full CV.