Add Alpaca Persian Dataset #3633

pourmand1376 · 2023-08-04T09:45:25Z

Hi,
In the last two days, I have been working on translating alpaca into Persian (Farsi) and this is the result. I have reviewed the translations and they are in my opinion pretty good.

Also, the dataset is still translating on Kaggle and will be finished in a couple of days. I will update the datasets accordingly when the translation is complete.

I have added two datasets. One is instruction-based and one is orca-style dataset. For the first one, I knew how to add it. But I don't know how to add the orca dataset to your datasets.

Thank you for your attention.

stefangrotz · 2023-08-04T10:31:24Z

Hey great work, I always wanted translate this dataset to German or Esperanto. The main problem here is that the license of Alpaca isn't usable for Open Source LLMs because ChatGPT does not allow to use its output to train other models. Because of that it cannot be used for Open Assistant or for any commercial project.

However having this dataset surely is useful to train experimental systems and science projects.

BTW. do you know about the Alpaca Data Cleaned project? It fixed a lot of the errors in the dataset, like wrong calculations: https://github.com/gururise/AlpacaDataCleaned

pourmand1376 · 2023-08-04T10:59:47Z

Hey great work, I always wanted translate this dataset to German or Esperanto. The main problem here is that the license of Alpaca isn't usable for Open Source LLMs because ChatGPT does not allow to use its output to train other models. Because of that it cannot be used for Open Assistant or for any commercial project.

However having this dataset surely is useful to train experimental systems and science projects.

BTW. do you know about the Alpaca Data Cleaned project? It fixed a lot of the errors in the dataset, like wrong calculations: https://github.com/gururise/AlpacaDataCleaned

Hi, Thanks for your comment.

Yes, I have used the cleaned version.

Sadly, I didn't know about license restrictions. The dataset itself (Alapaca) is published under Apache 2.0. I have also published my dataset under Apache 2.0.

Isn't that good enough?

stefangrotz · 2023-08-04T11:52:40Z

Unfortunately not, see https://github.com/gururise/AlpacaDataCleaned#license
This is one of the main reasons why OA started to build up a crowd sourced conversational dataset.

Maybe you can translate the english and the spanish Open Assistant Dataset instead? Both are quite big.
https://huggingface.co/datasets/OpenAssistant/oasst1

pourmand1376 added 3 commits August 4, 2023 09:34

add alpaca

ef46dbf

add alpaca

3d269e4

add alpaca multi

c2166e4

pourmand1376 requested review from Vechtomov, bitplane, huu4ontocord, olliestanley, sedthh, theblackcat102, sanagno, dvruette, andreaskoepf, yk, jordiclive and shahules786 as code owners August 4, 2023 09:45

andreaskoepf added the data label Aug 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Alpaca Persian Dataset #3633

Add Alpaca Persian Dataset #3633

pourmand1376 commented Aug 4, 2023

stefangrotz commented Aug 4, 2023 •

edited

pourmand1376 commented Aug 4, 2023 •

edited

stefangrotz commented Aug 4, 2023 •

edited

Add Alpaca Persian Dataset #3633

Are you sure you want to change the base?

Add Alpaca Persian Dataset #3633

Conversation

pourmand1376 commented Aug 4, 2023

stefangrotz commented Aug 4, 2023 • edited

pourmand1376 commented Aug 4, 2023 • edited

stefangrotz commented Aug 4, 2023 • edited

stefangrotz commented Aug 4, 2023 •

edited

pourmand1376 commented Aug 4, 2023 •

edited

stefangrotz commented Aug 4, 2023 •

edited