The HKU-CAES learner corpus: Selecting error and rhetorical criteria for corpus annotation
Dr Crosthwaite, Peter Robert (Principal investigator)
Corpus linguistics, Learner corpus annotation, L2 writing, Second language acquisition, EAP, Discourse analysis
English Languages and Literature,Language Development, Second Language Acquisition, Audiology
Block Grant Earmarked for Research (104)
HKU Project Code
Seed Fund for Basic Research
The proposed research is a pilot study conducted as part of the construction of a proposed ‘HKU-CAES’ 6,000,000 word learner corpus of second language (L2) academic English (EAP) essays and reports obtained from freshman students enrolled on HKU’s largest current academic program, the CAES1000 ‘Core University English’ (CUE) academic skills course. (The main study is subject to a pending RGC funding application under the 2014-2015 Early Career Scheme [ECS] exercise, Ref: 27602515). This pilot study calls for the development of a practical theoretical and technical methodology regarding the coding and annotation procedure involved when building a large-scale learner corpus. In particular, the study aims to determine the procedure and categories required for the annotation of L2 errors in EAP essays and reports, and the feasibility and methodology involved in annotating L2 data for EAP rhetorical structure. Comprising 80% of all freshman undergraduates at HKU, students at HKU’s Centre for Applied English Studies (CAES) generate over 3,000,000 words of written data each semester, data broadly representative of students taking compulsory writing programs across HK universities. The EAP course in question (CUE) covers general lexico-syntactic features including 'vocabulary' and 'grammar' alongside 'skills' designed to promote both awareness of and practice in the use of rhetorical discourse features. These features include the development and presentation of 'stance', the use of hedging for language delicacy, the use of counter-arguments and rebuttals for critical analysis, the need for and use of topic sentences, the use of academic tone, and the effects of various cohesive devices used for reference and linking. For example, a typical statement of stance in L1 English academic discourse is generally found at the beginning of a paragraph, and may include the use of hedging (following Hyland, 1998) to make the statement appear cautious (and more difficult to be countered), alongside the use of academic tone to remove personal, subjective statements. However, in pre-training L2 academic discourse, student production of stance is often overly personal, and can easily be countered (e.g. ‘embryo selection is mostly ethically unacceptable’ [L1] vs. ‘I think that embryo selection is always a terrible idea’ [L2]). As the course progresses, the linguistic and organisational differences between the pre- and post-training expressions of stance and other rhetorical features should be apparent in the corpus data, and can be teased out through cross-corpus analysis. The difficulty comes, however, from isolating the expressions of stance in the L2 data in the first instance, then devising an appropriate method for automatically annotating errors and rhetorical structure in a large learner corpus. Given the large word counts involved, it is vital to ascertain the frequency and type of errors and rhetorical features involved in L2 EAP production by HKU students, so that a specialised series of annotation categories can be developed tailored specifically for this dataset. In addition, while the accuracy of automatic natural language processing (NLP) tools is now generally very high for native language (L1) data (approaching 95% according to Geertzen, Alexopolou & Korhonen ), L2 data, which contain numerous grammatical errors and pragmatic infelicity (e.g. Carrio Pastor & Mestre Mestre, 2013; Crosthwaite, 2014), have traditionally been deemed considerably more difficult and labour-intensive to analyse for multi-million word collections. Thus, the proposed study aims to assess the practicalities involved during the corpus annotation process for errors and rhetorical structure when dealing with 'messy' L2 data. In particular, strategies for improving the speed and accuracy of L2 error and rhetorical structure annotation are to be developed, including a method for selecting and isolating error and rhetorical categories from the data, the trialling of data annotation software, and the development of a data-driven approach to the semi-automatic annotation of large-scale corpus data (if an existing methodology is not suitable for the proposed dataset). In addition, some initial suggestions regarding the usefulness of learner corpora when assessing the impact of the previously-mentioned EAP course in terms of L2 rhetorical development (from this small pilot collection of data, at least) will also be presented. It is hoped that the findings from the data, and subsequent discussion, will inform the future development of the proposed 6,000,000 word longitudinal learner corpus of L2 academic English essays and reports mentioned above, as well as a submitted TGD proposal (Measuring and assessing student learning through corpora: The HKU Corpus of English for Dentistry) focusing on the building of an English for Dentistry corpus, on which the author is the Co-Investigator (CI). The research questions are summarised as follows: RQ1: Based on the collected pilot data, which categories of L2 error and EAP rhetorical structure can and should be included in the final annotation scheme for a large annotated corpus? RQ2: Can an existing or data-driven method be developed for the semi-automatic annotation of L2 error and EAP rhetorical structure that would be suitable for a large annotated corpus? RQ3: After annotating the pilot data, what are the general trends regarding L2 error and EAP rhetorical structure as found in HKU L2 learner EAP essays and reports?