How do I find datasets for my research?
Answer
You can access datasets for research through open data platforms, AI focused repositories, and library subscriptions like the Linguistic Data Consortium (LDC).
Open Access Dataset Sources
- Google Dataset Search: Find datasets hosted across the web in multiple disciplines.
- Kaggle: Large collection of datasets for machine learning and data science.
- Zenodo: Research datasets shared by individuals and institutions.
- Papers with Code: Browse datasets linked to peer-reviewed papers in AI and ML.
- UCI Machine Learning Repository: Classic source for structured ML datasets.
MBZUAI Library Subscription: Linguistic Data Consortium (LDC)
The MBZUAI Library provides access to the Linguistic Data Consortium (LDC), a leading source of high-quality datasets used in:
- Natural Language Processing (NLP)
- Speech recognition
- Language annotation
- Machine translation
- Computational linguistics research
MBZUAI students and faculty have access to the LDC corpora from 2021 onward. Click here to learn how to access LDC resources.
If you need a dataset that is not available through our subscription, please email us with the details (dataset name and your research purpose) at libraryservices@mbzuai.ac.ae.