The University of Pittsburgh English Language Institute Corpus (PELIC)

Welcome to the homepage for the The University of Pittsburgh English Language Institute Corpus (PELIC). Here you will find information about the people involved with PELIC and the research using these data.

If you are interested in learning more about PELIC or downloading the dataset, please visit the PELIC-dataset GitHub repository which contains the publicly available data, a detailed description of the corpus, frequency statistics, and a set of lexical tools and tutorials.

PELIC overview
People
- ELI students and teachers
- ELI Data Mining Group
Acknowledgments
Selected publications and presentations based on PELIC data
Other research based on ELI students
PELIC used elsewhere

1. PELIC overview

The University of Pittsburgh English Language Institute Corpus (PELIC) is a large learner corpus of written and spoken texts. These texts were collected in an English for Academic Purposes (EAP) context over seven years in the University of Pittsburgh’s Intensive English Program, and were produced by students with a wide range of linguistic backgrounds and proficiency levels. PELIC is longitudinal, offering greater opportunities for tracking development in a natural classroom setting.

This webpage provides information about the people involved with and the research resulting from PELIC. Where possible, PDFs of the research are provided.

2. People

ELI students and teachers

First and foremost, we wish to thank and acknowledge the students of the ELI for their spoken and written contributions to PELIC, and for graciously allowing us to publicly share this data. We hope that the research stemming from their work will improve the quality of learning and teaching for other students in similar contexts who are striving to attain academic readiness.

We also wish to thank the teachers and administrators of the ELI for their assistance in diligently collecting the data from the students.

ELI Data Mining Group

The ELI Data Mining Group is a research group in the Department of Linguistics at the University of Pittsburgh. Their focus is on applying computational methods to the PELIC dataset.

Current ELI Data Mining Group:

Na-Rae Han - Principal Investigator
Alan Juffs - Principal Investigator
Ben Naismith - Pitt Linguistics Alumnus

ELI Data Mining Group associate members:

Joey Livorno - Undergraduate Student
Sean Steinle - Undergraduate Student
Eva Bacas - Former Undergraduate Programmer
Jamison Cooper-Leavitt - Former Post Doctoral Researcher
Elena Cimino - Former MA student
Noriyasu Harada Li - Former PhD student
Brianna Hill - Former Undergraduate Programmer
Cassandra Maz - Former Undergraduate Programmer
John Starr - Former Undergraduate Programmer
Daniel Zheng - Former Undergraduate Programmer

3. Acknowledgments

We would like to thank the National Science Foundation for their grant via the Pittsburgh Science of Learning Center, funded award number SBE-0836012. (Previously NSF award number SBE-0354420.)

There have been numerous people involved throughout the process of compiling and creating PELIC, assisting with data collection, coding, etc. In particular, we wish to acknowledge the following programmers for their important roles: Ben Madore, Shanwen Yu, and Michael Nugent.

4. Selected publications and presentations based on PELIC data

Naismith, B., Juffs, A., Han, N.-R., Zheng, D. (2022). Handle it in-house? Learner corpora frequency lists and lexical sophistication. International Journal of Corpus Linguistics.
Naismith, B., Han, N.-R., & Juffs, A (2022). The University of Pittsburgh English Language Institute Corpus (PELIC). International Journal of Learner Corpus Research, 8(1), pp. 121-138.
Naismith, B., & Juffs, A (2021). Finding the sweet spot: Learners’ productive knowledge of mid-frequency lexical items. Language Teaching Research.
Vercellotti, M. L., Juffs, A., & Naismith, B. (2021). Multiword sequences in L2 English language learners’ speech: The relationship between trigrams and lexical variety across development. System.
Juffs, A. (2020). Aspects of Language Development in an Intensive English Program. New York: Routledge. Hardback: 9781138048362.
Naismith, B., & Kanwit, M. (2020). A Corpus Study of the English Suffixes -ness and -acy: Productivity, Genre, and Implications for L2 Learning. Canadian Journal of Applied Linguistics, 24(1), pp. 115-137.
Juffs, A. (2019). Lexical development in the writing of English Language Program Students. In R. M. DeKeyser & G. P. Botana (Eds.), Reconciling methodological demands with pedagogical applicability (pp. 179-200). Amsterdam: John Benjamins.
Naismith, B. (2019). Lexical Sophistication Measurements: Applications in Teaching and Assessment. TESOL 2019 International Convention March 13th, 2019. Atlanta, GA.
Juffs, A., & Han, N-R. (2019). Combining Formal and Usage-Based Theories with Data Science Techniques in Measuring the Development of Syntactic Complexity in Written Production. American Association of Applied Linguistics, International Conference. March, 12, 2019, Atlanta, GA.
Naismith, B., Han, N.-R., Juffs, A., Hill, B. L., & Zheng, D. (2018). Accurate Measurement of Lexical Sophistication with Reference to ESL Learner Data. Educational Data Mining Conference 2018, Buffalo, NY. July 15-18, 2018.
Juffs, A. (2017). Moving generative SLA from knowledge of constraints to production data in educational settings. Journal of the Japanese Second Language Association, 16, 19-38.
Vercellotti, M. L. (2017). The development of complexity, accuracy and fluency in second language performance. Applied Linguistics, 38, 90-111.
Vercellotti, M. L., & Packer, J. (2016). Shifting structural complexity: The production of clause types in speeches given by English for academic purposes students. Journal of English for Academic Purposes, 22, 179-190.
Li, N. & Juffs, A. (2015). The influence of moraic structure on English L2 syllable final consonants. 2014 Annual Meeting on Phonology.
McCormick, D. E., & Vercellotti, M. L. (2013). Examining the Impact of Self-Correction Notes on Grammatical Accuracy in Speaking. TESOL Quarterly, 47(2), 410-420.
Spinner, P. (2011). Second language assessment and morphosyntactic development. Studies in Second Language Acquisition, 33, 529-561.

5. Other research based on ELI students

In addition to publications based on PELIC data, a number of studies and datasets have been published based on other data from this same population, i.e., students at the ELI:

Papers based on the Reading Tutor REAP in the English Language Institute

Eskenazi, M. & Juffs, A. (2012). Information retrieval for reading tutors. The Encyclopedia of Applied Linguistics.
Heilman, M., Collins-Thompson, K., Callan, J., Eskenazi, M., Juffs, A., & Wilson, L. (2010). Personalization of reading passages improves vocabulary acquisition. International Journal in Artificial Intelligence in Education, 20(1), 73-98.
Heilman, M., Juffs, A., & Eskenazi, M. (2007). Choosing Reading Passages for Vocabulary Learning by Topic to Increase Intrinsic Motivation. Frontiers in Artificial Intelligence and Applications, 157, 566-568.
Juffs, A. & Friedline, B. E. (2014). Sociocultural influences on the use of a web-based tool for learning English vocabulary. System, 42(2), 137-166.

Spoken data posted online by Vercellotti (2016) and available for download and analysis in CLAN

https://slabank.talkbank.org/access/English/Vercellotti.html

Published and unpublished MA theses and PhD dissertations written with support from LearnLab.org and the English Language Institute

List of unpublished and published work

6. PELIC used elsewhere

PELIC data can also be found in the following locations:

The IRIS digital repository, a digital repository of instruments and materials for research into second languages
English Grammar Pro by John Boz which analyzes the grammatical complexity of texts and provides authentic examples of grammar structures from expert and learner corpora
re3data.org, a registry of research data repositories