7 Global Repositories You Can Upload Your Resources and Datasets

Introduction

The NLP community has grown rapidly over the past decade, with especially notable advances in the last five years. Open science, collaboration, and accessible research have driven the development of tools and solutions such as chatbots and translation systems that are now part of everyday life. As research outputs multiply, international repositories have become key spaces for sharing papers, datasets, and tools with a global audience. They not only provide visibility and ease of use but also offer standardised metadata and identifiers (such as DOIs) that ensure interoperability, reliable tracking, and better discoverability across search engines. For researchers working in diverse and distributed teams, these repositories are increasingly the go-to choice for hosting and preserving NLP resources.

Purpose of the guide

This guide highlights popular international repositories such as Zenodo and Hugging Face Hub, helping researchers, technologists, and academics make informed decisions about where to host their resources.

While these platforms provide broad visibility and access to a global audience, they may not always cater to the specific needs or community support required for Africa-focused research outputs.

Some popular global repositories

Zenodo

Zenodo is an open-access research data repository developed by CERN. It allows researchers to share, preserve, and cite datasets, software, and publications with DOI assignments.

Target Audience Researchers, academics, data curators, and institutions looking for long-term storage and citable references for their work.

Key Features

Open Access: Free repository for data, publications, and software.
DOI Assignment: Provides permanent identifiers for shared work.
Data Preservation: Long-term storage and accessibility guaranteed.
Integration: Supports GitHub integration for archiving releases.

Hugging Face Hub

Hugging Face is a specialized AI/ML platform that began with a focus on NLP and has since grown into a central hub for state-of-the-art machine learning models, datasets, and tools across domains like vision, audio, and multimodal learning.

Target Audience Machine learning engineers, data scientists, researchers, developers, and organisations seeking to build, fine-tune, or deploy AI models.

Key Features

Model & Dataset Hub: 100,000+ pre-trained models and thousands of datasets.
Popular Libraries: Transformers, Datasets, and Tokenizers.
Community & Collaboration: Share, document, and co-develop models and data.
Deployment & Inference: Easy-to-use APIs, inference endpoints, and integrations for production.
Model Documentation: Emphasis on transparency through “model cards” and versioning.

ArXiv

ArXiv is an open-access preprint repository for scholarly articles in fields like computer science, mathematics, physics, and more. It is widely used for sharing cutting-edge research before peer review.

Target Audience Researchers, academics, and practitioners seeking to share or stay updated on the latest scientific advancements.

Key Features

Preprint Hosting: Upload and access manuscripts before formal publication.
Wide Coverage: Includes computer science, AI/ML, physics, and other STEM domains.
Open Access: Free to upload and download.
Version Control: Allows updates with clear version tracking.

GitHub

GitHub is the world’s largest code hosting and collaboration platform, built around Git version control. It enables developers to manage, share, and collaborate on code and projects.

Target Audience Software engineers, open-source contributors, research groups, and organizations managing codebases.

Key Features

Code Hosting: Public and private repositories with Git version control.
Collaboration: Pull requests, issues, and project boards.
Integrations: CI/CD workflows with GitHub Actions.
Community: Open-source contributions, discussions, and templates.

Kaggle

Kaggle is a data science platform owned by Google, offering datasets, competitions, and collaborative notebooks for building and sharing machine learning projects.

Target Audience Data scientists, machine learning enthusiasts, students, and organizations sharing datasets or hosting challenges.

Key Features

Datasets: Thousands of public datasets for ML experimentation.
Competitions: Machine learning contests with real-world problems.
Kaggle Notebooks: Free cloud-based coding environment (Python/R).
Community: Active forums, tutorials, and kernels for knowledge sharing.

Mendeley Data

Mendeley Data is a secure cloud-based repository by Elsevier that allows researchers to share, preserve, and cite datasets across disciplines.

Target Audience Researchers, academics, and institutions that need to publish, share, or manage datasets.

Key Features

Data Hosting: Store and share datasets of any size or format.
DOI Assignment: Each dataset receives a unique DOI for citation.
Versioning: Track dataset updates with clear version history.
Access Control: Choose between public sharing or private collaboration.

7. OpenSLR

OpenSLR (Speech and Language Resources) is a specialized repository for open-source datasets, lexicons, and tools in speech and language processing.

Target Audience Researchers, developers, and organizations working in speech recognition, TTS, and NLP with speech components.

Key Features

Speech Datasets: Large open-source speech corpora for training models.
Lexicons & Tools: Language-specific lexicons and processing tools.
Open Access: Freely available for research and development.
Community Contribution: Encourages dataset sharing and expansion.

Comparison Table

We have added a few other popular repositories, which you can find in the comparison table below.

Hugging Face

Hosting & sharing ML models, datasets, tools

ML engineers, researchers, developers

~5GB per file (larger via Git LFS)

Yes

Very large, active AI community

Zenodo

Research data, software, and publication archiving

Researchers, academics, institutions

Up to 50GB per dataset

Yes

Large academic user base

GitHub

Code hosting, collaboration, version control

Developers, organizations, OSS communities

~2GB per repo (100MB/file soft limit)

Yes (Git built-in)

Largest dev community worldwide

arXiv

Preprint repository for academic papers

Researchers, academics, students

PDFs & LaTeX (no strict GB limit)

Versioned updates

Very large, global academic base

Kaggle

Datasets, ML competitions, collaborative notebooks

Data scientists, students, researchers

20GB dataset upload limit

Yes (dataset updates)

Very large, global data science community

Mendeley Data

Dataset storage & citation with DOIs

Researchers, academics

Up to 10GB per dataset

Yes

Medium academic base

OpenSLR

Speech & language resources (datasets, lexicons, tools)

Speech/NLP researchers, developers

No strict stated limit

Limited

Niche but important in speech

Common Voice (Mozilla)

Crowdsourced speech dataset collection for ASR

Researchers, developers, language communities

Large dataset downloads (100s GB)

Versioned releases

Large, global open-source community

Open Science Framework (OSF)

Project management & data sharing platform

Researchers, institutions, labs

5GB per file (unlimited projects)

Yes

Large, growing academic network

Figshare

Research data & publication repository with DOIs

Researchers, universities, funders

Up to 5GB per file (larger via institutional accounts)

Yes

Large global research community

Choosing the right repository is important for maximising the visibility and impact of your research. While International platforms like Zenodo, Hugging Face, GitHub, and others offer you a broad reach, standardised citation, and valuable integration options, they might not be the best choice for African-focused research and community building.

About Lanfrica

lanfrica is an open, centralised catalogue dedicated to African language technology resources. We link papers, datasets, benchmarks, models, and research projects. We make it significantly easier for researchers, academics, technologists, and policy makers to discover and reuse these resources. Instead of searching across multiple scattered repositories, users can browse our platform to find African-language models, corpora, or evaluation sets, accelerating collaboration and innovation across the continent.

PreviousSome African NLP Datasets That You Can Use To Build African AI

Last updated 1 month ago

Was this helpful?