
I’m a researcher who focuses on improving data practices for fair and trustworthy machine learning. I’m interested in the intersection of data engineering and society, as well as how the new AI regulation will affect how we develop ML applications. I’m currently involved in the Croissant Responsible AI extension, which is a community-led project led by the MLCommons consortium to standardize data-sharing practices for machine learning. I have taught, created course content, and coordinated practicums for web and software development courses at UAB and UOC universities. Previously, I worked as a software architect for international companies, and I was elected to the Catalan Parliament (2015-2018) and served as the Science and Technology Committee’s spokesman.
Publications:
- A Standardized Machine-readable Dataset Documentation Format for Responsible AI. Nitisha Jain, Mubashara Akhtar, Joan Giner-Miguelez, and others, working paper.
- Croissant: A Metadata Format for ML-Ready Datasets. Akhtar, Mubashara and Benjelloun, Omar and Conforti, Costanza and Giner-Miguelez, Joan and others, NeurIPS 2024.
- On the Readiness of Scientific Data for a Fair and Transparent Use in Machine Learning . Scientific Data. Nature. January 2025.
- Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning. April 2024, arXiv preprint
- DataDoc Analyzer, a tool for analyzing the documentation of scientific datasets. At 32nd ACM International Conference on Information and Knowledge Management (CIKM), October 2023.
- A domain-specific language for describing machine learning datasets. Journal of Computer Languages, April 2023, 101209, ISSN 2590-1184,
- DescribeML: A dataset description tool for machine learning. Science of Computer Programming, 103030, November 2022,
- Enabling Content Management Systems as an Information Source in Model-Driven Projects. In International Conference on Research Challenges in Information Science (RCIS) pp. 513-528. Springer, Cham. January 2022,