ABSTRACT
Recent advances in artificial intelligence have been driven by an exponential growth in digitised data. Natural language processing, in particular, has been transformed by machine learning models such as OpenAI’s GPT-3 which generates human-like text so realistic that its developers have warned of the dangers of its misuse. In recent months OpenAI released Codex, a new deep learning model trained on Python code from more than 50 million GitHub repositories. Provided with a natural language description of a programming problem as input, Codex generates solution code as output. It can also explain (in English) input code, translate code between programming languages, and more. In this work, we explore how Codex performs on typical introductory programming problems. We report its performance on real questions taken from introductory programming exams and compare it to results from students who took these same exams under normal conditions, demonstrating that Codex outscores most students. We then explore how Codex handles subtle variations in problem wording using several published variants of the well-known “Rainfall Problem” along with one unpublished variant we have used in our teaching. We find the model passes many test cases for all variants. We also explore how much variation there is in the Codex generated solutions, observing that an identical input prompt frequently leads to very different solutions in terms of algorithmic approach and code length. Finally, we discuss the implications that such technology will have for computing education as it continues to evolve, including both challenges and opportunities.
- Ibrahim Albluwi. 2019. Plagiarism in Programming Assessments: A Systematic Review. ACM Trans. Comput. Educ. 20, 1, Article 6 (Dec. 2019), 28 pages. https://doi.org/10.1145/3371156Google ScholarDigital Library
- Joe Michael Allen, Frank Vahid, Alex Edgcomb, Kelly Downey, and Kris Miller. 2019. An Analysis of Using Many Small Programs in CS1. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education(SIGCSE ’19). ACM, NY, NY, USA, 585–591. https://doi.org/10.1145/3287324.3287466Google ScholarDigital Library
- Brett A. Becker and Keith Quille. 2019. 50 Years of CS1 at SIGCSE: A Review of the Evolution of Introductory Programming Education Research. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education(SIGCSE ’19). ACM, NY, NY, USA, 338–344. https://doi.org/10.1145/3287324.3287432Google ScholarDigital Library
- Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, 2020. Language Models Are Few-shot Learners. arXiv preprint arXiv:2005.14165(2020).Google Scholar
- Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, 2021. Evaluating Large Language Models Trained on Code. (2021). arxiv:cs.LG/2107.03374https://arxiv.org/abs/2107.03374Google Scholar
- Tyne Crow, Andrew Luxton-Reilly, and Burkhard Wuensche. 2018. Intelligent Tutoring Systems for Programming Education: A Systematic Review. In Proceedings of the 20th Australasian Computing Education Conference(ACE ’18). ACM, NY, NY, USA, 53–62. https://doi.org/10.1145/3160489.3160492Google ScholarDigital Library
- Martin Dick, Judy Sheard, Cathy Bareiss, Janet Carter, Donald Joyce, 2002. Addressing Student Cheating: Definitions and Solutions. In Working Group Reports from ITiCSE on Innovation and Technology in Computer Science Education(ITiCSE-WGR ’02). ACM, NY, NY, USA, 172–184. https://doi.org/10.1145/960568.783000Google ScholarDigital Library
- John L. Donaldson, Ann-Marie Lancaster, and Paula H. Sposato. 1981. A Plagiarism Detection System. SIGCSE Bull. 13, 1 (Feb. 1981), 21–25. https://doi.org/10.1145/953049.800955Google ScholarDigital Library
- Alireza Ebrahimi. 1994. Novice Programmer Errors: Language Constructs and Plan Composition. Int. J. Hum.-Comput. Stud. 41, 4 (Oct. 1994), 457–480. https://doi.org/10.1006/ijhc.1994.1069Google ScholarDigital Library
- Kathi Fisler. 2014. The Recurring Rainfall Problem. In Proceedings of the Tenth Annual Conference on International Computing Education Research(ICER ’14). ACM, NY, NY, USA, 35–42. https://doi.org/10.1145/2632320.2632346Google ScholarDigital Library
- Luciano Floridi and Massimo Chiriatti. 2020. GPT-3: Its Nature, Scope, Limits, and Consequences. Minds and Machines 30, 4 (2020), 681–694. https://doi.org/10.1007/s11023-020-09548-1Google ScholarDigital Library
- Lex Fridman. 2021. Donald Knuth: Programming, Algorithms, Hard Problems & the Game of Life | Lex Fridman Podcast #219. https://www.youtube.com/watch?v=EE1R8FYUJm0&t=1995sGoogle Scholar
- Danielle Gonzalez, Thomas Zimmermann, and Nachiappan Nagappan. 2020. The State of the ML-Universe: 10 Years of Artificial Intelligence & Machine Learning Software Development on GitHub. In Proceedings of the 17th International Conference on Mining Software Repositories(MSR ’20). ACM, NY, NY, USA, 431–442. https://doi.org/10.1145/3379597.3387473Google ScholarDigital Library
- Mark Guzdial. 2011. From Science to Engineering. Commun. ACM 54, 2 (Feb. 2011), 37–39. https://doi.org/10.1145/1897816.1897831Google ScholarDigital Library
- Mark Guzdial. 2013. Exploring Hypotheses about Media Computation. In Proceedings of the Ninth Annual International ACM Conference on International Computing Education Research(ICER ’13). ACM, NY, NY, USA, 19–26. https://doi.org/10.1145/2493394.2493397Google ScholarDigital Library
- Mark Guzdial, Rachel Fithian, Andrea Forte, and Lauren Rich. 2003. Report on Pilot Offering of CS1315 Introduction to Media Computation With Comparison to CS1321 and COE1361.Google Scholar
- Theresia Devi Indriasari, Andrew Luxton-Reilly, and Paul Denny. 2020. A Review of Peer Code Review in Higher Education. ACM Trans. Comput. Educ. 20, 3, Article 22 (Sept. 2020), 25 pages. https://doi.org/10.1145/3403935Google ScholarDigital Library
- Antti-Jussi Lakanen, Vesa Lappalainen, and Ville Isomöttönen. 2015. Revisiting Rainfall to Explore Exam Questions and Performance on CS1. In Proceedings of the 15th Koli Calling Conference on Computing Education Research(Koli Calling ’15). ACM, NY, NY, USA, 40–49. https://doi.org/10.1145/2828959.2828970Google ScholarDigital Library
- Thomas Lancaster and Codrin Cotarlan. 2021. Contract Cheating by STEM Students Through a File Sharing Website: A Covid-19 Pandemic Perspective. International Journal for Educational Integrity 17, 1 (2021), 1–16.Google ScholarCross Ref
- Alberta Lipson and Norma McGavern. 1993. Undergraduate Academic Dishonesty at MIT. Results of a Study of Attitudes and Behavior of Undergraduates, Faculty, and Graduate Teaching Assistants.(1993).Google Scholar
- Andrew Luxton-Reilly. 2009. A Systematic Review of Tools That Support Peer Assessment. Computer Science Education 19, 4 (2009), 209–232.Google ScholarCross Ref
- Andrew Luxton-Reilly, Simon, Ibrahim Albluwi, Brett A. Becker, Michail Giannakos, 2018. Introductory Programming: A Systematic Literature Review. In Proceedings Companion of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education(ITiCSE 2018 Companion). ACM, NY, NY, USA, 55–106. https://doi.org/10.1145/3293881.3295779Google ScholarDigital Library
- Zohar Manna and Richard J. Waldinger. 1971. Toward Automatic Program Synthesis. Commun. ACM 14, 3 (March 1971), 151–165. https://doi.org/10.1145/362566.362568Google ScholarDigital Library
- Sathiamoorthy Manoharan. 2017. Personalized Assessment as a Means to Mitigate Plagiarism. IEEE Transactions on Education 60, 2 (2017), 112–119. https://doi.org/10.1109/TE.2016.2604210Google ScholarDigital Library
- Sathiamoorthy Manoharan and Ulrich Speidel. 2020. Contract Cheating in Computer Science: A Case Study. In 2020 IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE). 91–98. https://doi.org/10.1109/TALE48869.2020.9368454Google ScholarCross Ref
- Charlie McDowell, Linda Werner, Heather Bullock, and Julian Fernald. 2002. The Effects of Pair-Programming on Performance in an Introductory Programming Course. SIGCSE Bull. 34, 1 (Feb. 2002), 38–42. https://doi.org/10.1145/563517.563353Google ScholarDigital Library
- Cade Metz. 2021. A.I. Can Now Write Its Own Computer Code. Thats Good News for Humans.https://www.nytimes.com/2021/09/09/technology/codex-artificial-intelligence-coding.htmlGoogle Scholar
- OpenAI. 2020. About OpenAI. https://openai.com/about/Google Scholar
- Paul Phillips and Luc Cohen. 2014. Convictions of Plagiarism in Computer Science Courses on the Rise. The Daily Princetonian, March 4 (2014), 2014.Google Scholar
- Eric Roberts. 2002. Strategies for Promoting Academic Integrity in CS Courses. In 32nd Annual Frontiers in Education, Vol. 2. IEEE, F3G–F3G. https://doi.org/10.1109/FIE.2002.1158209Google ScholarCross Ref
- Kevin Scott. 2020. Microsoft teams up with OpenAI to Exclusively License GPT-3 Language Model. https://blogs.microsoft.com/blog/2020/09/22/microsoft-teams-up-with-openai-to-exclusively-license-gpt-3-language-model/Google Scholar
- Otto Seppälä, Petri Ihantola, Essi Isohanni, Juha Sorva, and Arto Vihavainen. 2015. Do We Know How Difficult the Rainfall Problem Is?. In Proceedings of the 15th Koli Calling Conference on Computing Education Research(Koli Calling ’15). ACM, NY, NY, USA, 87–96. https://doi.org/10.1145/2828959.2828963Google ScholarDigital Library
- Sam Shead. 2021. Why Everyone is Talking About an Image Generator Released by an Elon Musk-Backed A.I. Lab. https://www.cnbc.com/2021/01/08/openai-shows-off-dall-e-image-generator-after-gpt-3.htmlGoogle Scholar
- Judy Sheard, Angela Carbone, and Martin Dick. 2003. Determination of Factors Which Impact on IT Students’ Propensity to Cheat. In Proceedings of the Fifth Australasian Conference on Computing Education - Volume 20(ACE ’03). Australian Computer Society, Inc., AUS, 119–126.Google Scholar
- Judy Sheard, Simon, Matthew Butler, Katrina Falkner, Michael Morgan, 2017. Strategies for Maintaining Academic Integrity in First-Year Computing Courses. In Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education(ITiCSE ’17). ACM, NY, NY, USA, 244–249. https://doi.org/10.1145/3059009.3059064Google ScholarDigital Library
- Lee S. Shulman. 2005. Signature Pedagogies in the Professions. Daedalus 134, 3 (2005), 52–59. http://www.jstor.org/stable/20027998Google ScholarCross Ref
- Simon. 2013. Soloway’s Rainfall Problem Has Become Harder. In 2013 Learning and Teaching in Computing and Engineering. 130–135. https://doi.org/10.1109/LaTiCE.2013.44Google ScholarDigital Library
- Simon. 2017. Designing Programming Assignments to Reduce the Likelihood of Cheating. In Proceedings of the 19th Australasian Computing Education Conference(ACE ’17). ACM, NY, NY, USA, 42–47. https://doi.org/10.1145/3013499.3013507Google ScholarDigital Library
- E. Soloway. 1986. Learning to Program = Learning to Construct Mechanisms and Explanations. Commun. ACM 29, 9 (Sept. 1986), 850–858. https://doi.org/10.1145/6592.6594Google ScholarDigital Library
- Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli. 2021. Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. arXiv preprint arXiv:2102.02503(2021).Google Scholar
- Laurie A. Williams and Robert R. Kessler. 2000. All I Really Need to Know about Pair Programming I Learned in Kindergarten. Commun. ACM 43, 5 (May 2000), 108–114. https://doi.org/10.1145/332833.332848Google ScholarDigital Library
- Wojciech Zaremba, Greg Brockman, and OpenAI. 2021. OpenAI Codex. https://openai.com/blog/openai-codex/Google Scholar
Index Terms
- The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming
Recommendations
My AI Wants to Know if This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises
ACE '23: Proceedings of the 25th Australasian Computing Education ConferenceThe introduction of OpenAI Codex sparked a surge of interest in the impact of generative AI models on computing education practices. Codex is also the underlying model for GitHub Copilot, a plugin which makes AI-generated code accessible to students ...
Evaluating the Performance of Code Generation Models for Solving Parsons Problems With Small Prompt Variations
ITiCSE 2023: Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1The recent emergence of code generation tools powered by large language models has attracted wide attention. Models such as OpenAI Codex can take natural language problem descriptions as input and generate highly accurate source code solutions, with ...
“It’s Weird That it Knows What I Want”: Usability and Interactions with Copilot for Novice Programmers
Recent developments in deep learning have resulted in code-generation models that produce source code from natural language and code-based prompts with high accuracy. This is likely to have profound effects in the classroom, where novices learning to code ...
Comments