7 Global Repositories You Can Upload Your Resources and Datasets
Introduction
The NLP community has grown rapidly over the past decade, with especially notable advances in the last five years. Open science, collaboration, and accessible research have driven the development of tools and solutions such as chatbots and translation systems that are now part of everyday life. As research outputs multiply, international repositories have become key spaces for sharing papers, datasets, and tools with a global audience. They not only provide visibility and ease of use but also offer standardised metadata and identifiers (such as DOIs) that ensure interoperability, reliable tracking, and better discoverability across search engines. For researchers working in diverse and distributed teams, these repositories are increasingly the go-to choice for hosting and preserving NLP resources.
Purpose of the guide
This guide highlights popular international repositories such as Zenodo and Hugging Face Hub, helping researchers, technologists, and academics make informed decisions about where to host their resources.
While these platforms provide broad visibility and access to a global audience, they may not always cater to the specific needs or community support required for Africa-focused research outputs.
Some popular global repositories
Zenodo is an open-access research data repository developed by CERN. It allows researchers to share, preserve, and cite datasets, software, and publications with DOI assignments.
Target Audience Researchers, academics, data curators, and institutions looking for long-term storage and citable references for their work.
Key Features
Open Access: Free repository for data, publications, and software.
DOI Assignment: Provides permanent identifiers for shared work.
Data Preservation: Long-term storage and accessibility guaranteed.
Integration: Supports GitHub integration for archiving releases.
Hugging Face is a specialized AI/ML platform that began with a focus on NLP and has since grown into a central hub for state-of-the-art machine learning models, datasets, and tools across domains like vision, audio, and multimodal learning.
Target Audience Machine learning engineers, data scientists, researchers, developers, and organisations seeking to build, fine-tune, or deploy AI models.
Key Features
Model & Dataset Hub: 100,000+ pre-trained models and thousands of datasets.
Popular Libraries: Transformers, Datasets, and Tokenizers.
Community & Collaboration: Share, document, and co-develop models and data.
Deployment & Inference: Easy-to-use APIs, inference endpoints, and integrations for production.
Model Documentation: Emphasis on transparency through “model cards” and versioning.
ArXiv is an open-access preprint repository for scholarly articles in fields like computer science, mathematics, physics, and more. It is widely used for sharing cutting-edge research before peer review.
Target Audience Researchers, academics, and practitioners seeking to share or stay updated on the latest scientific advancements.
Key Features
Preprint Hosting: Upload and access manuscripts before formal publication.
Wide Coverage: Includes computer science, AI/ML, physics, and other STEM domains.
Open Access: Free to upload and download.
Version Control: Allows updates with clear version tracking.
GitHub is the world’s largest code hosting and collaboration platform, built around Git version control. It enables developers to manage, share, and collaborate on code and projects.
Target Audience Software engineers, open-source contributors, research groups, and organizations managing codebases.
Key Features
Code Hosting: Public and private repositories with Git version control.
Collaboration: Pull requests, issues, and project boards.
Integrations: CI/CD workflows with GitHub Actions.
Community: Open-source contributions, discussions, and templates.
Kaggle is a data science platform owned by Google, offering datasets, competitions, and collaborative notebooks for building and sharing machine learning projects.
Target Audience Data scientists, machine learning enthusiasts, students, and organizations sharing datasets or hosting challenges.
Key Features
Datasets: Thousands of public datasets for ML experimentation.
Competitions: Machine learning contests with real-world problems.
Kaggle Notebooks: Free cloud-based coding environment (Python/R).
Community: Active forums, tutorials, and kernels for knowledge sharing.
Mendeley Data is a secure cloud-based repository by Elsevier that allows researchers to share, preserve, and cite datasets across disciplines.
Target Audience Researchers, academics, and institutions that need to publish, share, or manage datasets.
Key Features
Data Hosting: Store and share datasets of any size or format.
DOI Assignment: Each dataset receives a unique DOI for citation.
Versioning: Track dataset updates with clear version history.
Access Control: Choose between public sharing or private collaboration.
7. OpenSLR
OpenSLR (Speech and Language Resources) is a specialized repository for open-source datasets, lexicons, and tools in speech and language processing.
Target Audience Researchers, developers, and organizations working in speech recognition, TTS, and NLP with speech components.
Key Features
Speech Datasets: Large open-source speech corpora for training models.
Lexicons & Tools: Language-specific lexicons and processing tools.
Open Access: Freely available for research and development.
Community Contribution: Encourages dataset sharing and expansion.
Comparison Table
We have added a few other popular repositories, which you can find in the comparison table below.
Hugging Face
Hosting & sharing ML models, datasets, tools
ML engineers, researchers, developers
~5GB per file (larger via Git LFS)
Yes
Very large, active AI community
Research data, software, and publication archiving
Researchers, academics, institutions
Up to 50GB per dataset
Yes
Large academic user base
GitHub
Code hosting, collaboration, version control
Developers, organizations, OSS communities
~2GB per repo (100MB/file soft limit)
Yes (Git built-in)
Largest dev community worldwide
arXiv
Preprint repository for academic papers
Researchers, academics, students
PDFs & LaTeX (no strict GB limit)
Versioned updates
Very large, global academic base
Kaggle
Datasets, ML competitions, collaborative notebooks
Data scientists, students, researchers
20GB dataset upload limit
Yes (dataset updates)
Very large, global data science community
Mendeley Data
Dataset storage & citation with DOIs
Researchers, academics
Up to 10GB per dataset
Yes
Medium academic base
OpenSLR
Speech & language resources (datasets, lexicons, tools)
Speech/NLP researchers, developers
No strict stated limit
Limited
Niche but important in speech
Common Voice (Mozilla)
Crowdsourced speech dataset collection for ASR
Researchers, developers, language communities
Large dataset downloads (100s GB)
Versioned releases
Large, global open-source community
Open Science Framework (OSF)
Project management & data sharing platform
Researchers, institutions, labs
5GB per file (unlimited projects)
Yes
Large, growing academic network
Figshare
Research data & publication repository with DOIs
Researchers, universities, funders
Up to 5GB per file (larger via institutional accounts)
Yes
Large global research community
Choosing the right repository is important for maximising the visibility and impact of your research. While International platforms like Zenodo, Hugging Face, GitHub, and others offer you a broad reach, standardised citation, and valuable integration options, they might not be the best choice for African-focused research and community building.
About Lanfrica
lanfrica is an open, centralised catalogue dedicated to African language technology resources. We link papers, datasets, benchmarks, models, and research projects. We make it significantly easier for researchers, academics, technologists, and policy makers to discover and reuse these resources. Instead of searching across multiple scattered repositories, users can browse our platform to find African-language models, corpora, or evaluation sets, accelerating collaboration and innovation across the continent.
Last updated
Was this helpful?