A CORPUS-BASED MACRO-SYNTACTIC STUDY OF NAIJA (NIGERIAN PIDGIN)
1. STATE OF THE ART AND OBJECTIVES
Nigeria, with 160+ million inhabitants, is a huge and complex multilingual community with over 500 different languages (Lewis, Simons & Fennig 2013) used within the public and private social spaces. Among those, Naija, a creole that is also deceptively known as Nigerian Pidgin, is spoken as a first language (L1) by 5 million people, while over 70 million people use it as a second language (L2) or as an interethnic means of communication in Nigeria and in Nigerian Diaspora communities. Since the independence of Nigeria in 1960, Naija has been rapidly expanding from its original niche in the Niger delta area, to cover two-thirds of the country, up to Kaduna and Jos, and is now deeply rooted in the vast Lagos conurbation of over 20 million people. Apart from its original location, and one Lagos district, where it is learnt as a first language, and can be used as a single language (Elugbe & Omamor 1991), Naija is learnt alongside and not instead of other Nigerian languages. Naija has become, over the last 30 years the most important, most widely spread, and perhaps the most ethnically neutral lingua franca used in the country today.
The origin of Naija is generally described as a development out of an English-lexified jargon attested in the 18th Century in the coastal area of the Niger delta (River State), with some lexical influence from Krio through the activities of missionaries from Sierra Leone (Faraclas 1996). Today, the heartland of Naija is the Niger Delta, with Lagos and Calabar as secondary extensions. But a new development has taken place over the last 50 years whereby Naija has escaped from its original geographical niche where it functioned as an auxiliary medium of communication in restricted informal contexts by uneducated people (Deuber 2005) and is now commonly used all over Nigeria by the educated in informal conversations, and in formal domains, viz radio, politics, advertising, Christian religious activities, etc. In terms of functional status, English is Nigeria’s official language, and it is dominant in the education system and in written usages (literature, press, etc.). However, Naija has made considerable progress in formal contexts such as information transmission by government and non-government agencies, Christian religious practices, and although it is still excluded from the educational system, it is used unofficially in multilingual schools in southern Nigeria. Naija is a lingua franca in public informal communication in the south and to a certain extent in the north, and it is noticeably popular among university students and among educated speakers in private informal communication (Egbokhare 2004). And, last but not least, its use is an identifying feature of Nollywood, the prosperous Nigerian film industry now known all over the world.
At the same time as it grows in terms of status and functions, Naija expands geographically, and it is exposed to vernacular languages belonging to different genetic and typological groups (such as Yoruba in the Southwest; Igbo, etc. in the Southeast; Hausa further North). Through this process, does it undergo some degree of contact-induced variation beyond the odd word borrowed from those vernacular languages, or on the contrary, does one standard variety emerge through the influence of modern mass-media such as radio, television and video?
In its functional expansion, Naija is subject to extensive contact and influence from its original lexifier, i.e. English, which is the dominant formal and official language in Nigeria. A question arises as to the extent of this influence, today, and what can be deduced of the future of Naija. Does Naija, despite the influence of English (and the indigenous languages of Nigeria) maintain its existence as a discrete language (Deuber, op.cit.) or is it undergoing “decreolization”, resulting in what has been described as a post-creole (P/C) continuum (Rickford 1987)? In such a process, a whole range of “mesolectal” varieties create a continuum between the “basilect” (viz., in our case, Delta Naija) and the acrolectal varieties deeply influenced by the original lexifier and its local variant (viz. the Nigerian variety of English). Is Agheyisi (1984) right when she states that “the possibility of a systematic mesolectal variety emerging in the Nigerian situation is rather remote” (p. 230)? Deuber (2005) convincingly argues that the Naija variety spoken by educated speakers in Lagos is a discrete language, distinct and separate from English, and “the more competent a speaker is in both languages, the better he/she is able to keep them apart” (p. 203). However the question remains whether this situation applies to Naija outside Lagos where Naija is further influenced by local native languages (e.g. Yoruba, Igbo, Hausa).
The influence of written Nigerian English on Naija needs special consideration. The extension of Naija to formal usages such as radio news report, political and information podcast blogging, Bible translation, short story writing, exposes the language to the influence of written Nigerian English. News reports on radio are generally translated from press releases issued in English by news agencies. Podcast blogs are read from written texts. This new dimension is bound to influence the structure of the language. From the structure of oral Naija where utterances and information units are mainly structured by Information Stucture (IS), in written Naija, the structure of sentences tends to be informed by microsyntax. One of the aims of the project will be to evaluate the relative importance of IS and microsyntax in the various functions of Naija, in contrast to the Nigerian variety of English. This will be done through the comparison between a Reference Corpus of Naija and the International Corpus of English for Nigeria (ICE-Nigeria). This implies the development of an adequate annotation system and statistical methods. The results will contribute to the evaluation of the discreteness and independence of Naija in relation to Nigerian English. The following hypotheses will be tested: the priority of IS over microsyntax in the mapping between sound and meaning is (a) an identifying feature of oral Naija; (b) more important in oral Naija than in oral Nigerian English; (c) less important in formal than in informal Naija.
The problems facing any programme of an exhaustive and in-depth study of Naija are many. The first one is related to the popular view of Naija as a protean, ever changing, informal medium that has no unity, and varies with every place and situation where it is spoken. The present study is based on the assumption that it is a discrete language with a strong unity that accommodates a certain range of variation.
The second one is related to the success of the Bickerton-DeCamp theory of the creolization-decreolization cycle (cf. above) informing the work of researchers such as Faraclas (1996), Elugbe and Omamor (1991) who approach the study of Naija with a purist attitude, for whom the only form of Naija worth studying is the Warri-Sapele “creole” variety spoken in Delta State, and who consider other varieties at best as degraded forms working as a lingua franca commonly called Broken, at worst as “pseudo-pidgins” invading the press and media. Their descriptions of Naija are monolectal, based on their intuition as speakers of the language. The present study intends to be data driven and multilectal.
The third one is the difficulty of combining a structural approach with a sociological one. The structural approach, best illustrated by Faraclas (1996) concentrates on the grammar and vocabulary of the language as revealing the inner mechanisms responsible for the birth and evolution of creoles, pidgins, and languages in general. The sociological approach mostly favoured by Nigerian scholars centres on the study of language usage and representation among speakers, but the link with the nature and structure of the language is often absent (Ajibade, Adeyemi & Awopetu 2012). The present study intends to bridge the gap between mental representations of Naija and actual linguistic usages. It plans to use the most recent trends of data-driven, corpus-based sociolinguistics, combining qualitative and quantitative corpus methods with functional and structural analyses.
The fourth one is the difficulty attached to the mere size of the language, and the challenge it represents for a study to account for the geographical and functional variations of a language with 70 million speakers. This calls for careful corpus planning and well organised team work.
The general aim of this project is to take an exhaustive and in-depth look at the nature and functions of Naija (Nigerian Pidgin) in Nigeria today, in order to establish the link between change in structure and change in language use and function. It will make use of the most advanced developments in corpus studies and natural language processing which will combine with a sociolinguistic and geographical study of variation according to formal/informal uses, gender and education of speakers. The corpus will study natural (non-elicited) speech in order to evaluate the distance between Naija and Nigerian English through the study of intonation, information structure, morphology, micro- and macro-syntax.
The distinction between micro- and macrosyntax was first proposed by Blanche-Benveniste et al. (1990), Berrendonner (1990), and Cresti (2000) (but see also (Andersen & Nølke 2002) for an overview). These studies put forward macrosyntax as a level of linguistic description capable of accounting for a number of cohesion mechanisms that are particularly frequent in spontaneous spoken language – especially in spoken French – which cannot be simply regarded as microsyntactic government phenomena, such as, for example, the “paratactic” construction in (1):
(1) [ceux qui sont en location] [la moyenne] [c’est environ trois ans] (Rhaps-D0004 CFPP2000)
[those who are on a lease] [the average] [it’s about three years]
‘Those who are on a lease stay three years on average’
The same type of phenomenon is frequent in Naija too, e.g. in (2):
(2) [you carry your children go] [you go still buy food] (Deuber 2005)
[you bring you children] [you will still buy food]
‘Even if you bring you children, you will still have to buy food.’
While the different macrosyntactic models acknowledge that sequences such as (1) and (2) have to be considered as forming a cohesive unit at some level of linguistic description, they diverge slightly as far as the characterization of the nature of this cohesion is concerned. Macrosyntactic models characterize some major linguistic units that go beyond government proper and are usually described in the literature from a pragmatic perspective that focuses on their illocutionary or rhetorical values. Macrosyntax, instead, focuses on the span and the form of macrosyntactic units, using syntactic and distributionalsddfsf criteria (such as suppressions, insertions, commutations) to identify and delimit them. For all the macrosyntactic models, the main identifying criterion of a macrosyntactic unit is the possibility that this unit has to constitute an autonomous utterance.
In NaijaSynCor, since our practical objective is to create a corpus that allows us to study the interface between prosody and syntax, we need to clearly separate these two levels of analysis. Following the methodology first used in Rhapsodie, we have decided not to rely on prosodic criteria to define macrosyntactic units. Therefore we do not follow the prosodic definition of macrosyntactic units proposed by Berrendonner (2011) who describes the maximal extension of a macrosyntactic unit in terms of the presence of a conclusive intoneme; nor could we strictly follow the Florence school’s approach (Cresti 2000) that characterizes macrosyntactic units as sequences of prosodic, rather than syntactic, units.
Rather, we consider that macrosyntax describes the whole set of relations holding between the microsyntactic units that make up one and only one illocutionary act, although microsyntax can sometimes go beyond macrosyntactic units. This definition combines the syntactic model proposed by the Aix model (Blanche-Benveniste et al. 1990), according to which the minimal units that compose a macrosyntactic unit are syntactic in nature, and the pragmatic model developed by the Florence model (Cresti 2000), according to which the maximal extension of a macrosyntactic unit is defined in terms of illocution.
Such a choice led us to call the maximal macrosyntactic units Illocutionary Units (henceforth IUs) and to provide, in our work, an account and an annotation for the syntactic rather than the prosodic units that compose an IU.
The corpus on which the study will be done is designed so that the sampling, in its geographical and sociological dimensions, produces relevant data comparable to the corpora available for Lagos Naija (i.e. Deuber 2005) and Nigerian English (i.e. ICE Nigeria).
The International Corpus of English (ICE) began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Each ICE corpus consists of one million words of spoken and written English produced after 1989. The corpus contains samples of speech and writing by both males and females, and it includes a wide range of age groups. The corpora in ICE are being annotated at various levels to enhance their value in linguistic research. These levels are: Textual Mark-up; Word class Tagging; Syntactic Parsing. ICE-Nigeria was produced in April 2014 by the University of Augsburg (Germany) under the supervision of Ulrike Gut (Gut 2014). The spoken data for the Nigerian corpus is transcribed and time-aligned, with minimal annotation. Within this general aim the project sets up several descriptive, methodological and theoretical objectives:
- Building a reference 500,000 words oral corpus (the Reference Naija Corpus, RNC), collected in 8 different points of survey in the country, with a deeply annotated sub-section of 100,000 words (the Naija prosodic and syntactic Treebank, NTB). Annotated corpora are rare for most of the languages of the world, all the more so if one considers depth of annotation (part-of-speech tagging, syntactic parsing, prosodic annotation). This synchronic picture of Naija, documenting its geographic and demographic variation, is a rare opportunity to study the evolution of a fast emerging, vast new language spoken by tens of millions. This corpus is expected to provide the basis for the standardisation and development of the language.
- Comparing the RNC with the Nigerian International Corpus of English (ICE Nigeria), both qualitatively and quantitatively. Naija has been proved to be, in the use of the educated Nigerians living in Lagos, a discrete language that is developing and keeping its own distinctive identity and status separate from English (Deuber, 2005). This study aims to assess whether this holds true in the other parts of Nigeria where it is spoken. This comparison aims at evaluating the discreteness and independence of Naija in relation to Nigerian English, and test the correlation of potential variations to sociological/functional factors.
- Achieving a better understanding of the variations of Naija along the formal-informal functional scale through the study of its use on university campuses and in the media, and more specifically on the radio (news reporting, editorials, information, etc.). The following hypotheses will be tested: (i) educated Naija is more standardized possibly due to the geographical and social mobility of its speakers; (ii) it reveals a greater influence from English with more borrowing and more syntactic restructuring; (iii) scripted oral Naija reveals an even stronger influence from English than unscripted oral Naija. It is expected to provide the basis for the standardisation and development of the language. This assessment of the role and impact of new media in relation with the change of attitude of speakers concerning an emerging language is an unprecedented endeavour.
- Understanding the patterns observed in the prosody of emerging languages, and linking the prosodic description of Naija to that of its grammatical and information structures through the use of NLP tools. The aim is threefold: (i) produce the first complete prosodic description of an underdescribed language in Africa based on instrumental analyses and validated with a resynthesis tool (ii) provide the Naija Treebank (NTB) with an in-depth integrated annotation for part-of-speech (POS), Intonation (INT), micro- and macro-syntactic structures (SYN) and Information Structure (IS), thus producing a gold-standard benchmarking large treebank database, a first for an emerging language; (iii) developing Natural Language Processing (NLP) tools for Naija, namely a part-of-speech (POS) tagger, an English glosser, and a syntactic parser integrating a treatment of macrosyntactic constructions (dislocation, clefting …) and phenomena specific to spoken languages (disfluency, reformulation, discourse markers). The integration of macrosyntax into a syntactic parser is a ground-breaking endeavour where high-gain results are expected for the development of NLP tools. Another expected high-profit gain of this part of the study is proving, with the help of NLP tools, that prosody is essential in the mapping of meaning onto speech utterances. By changing the settings of the tools, it will be possible to evaluate the input of prosody in the parsing of syntactic structures by testing whether the addition of prosodic information improves the efficiency of the syntactic parser, especially in the recognition of macrosyntactic patterns. This part of the project is the most challenging, and the most promising from a theoretical and technological point of view.
2. SCIENTIFIC PROGRAMME AND METHODOLOGY
The project is a collaborative effort of two research units that have proved their expertise in corpus annotation in previous programmes: Llacan, on lesser-described languages (Corpafroas and Cortypo); Modyco, on the interaction of prosody and syntax in French (ANR Rhapsodie and Orfeo) and the development of large treebanks (ANR Orféo), and two Nigerian leading experts on Naija (F. Egbokhare & C. Ofulue). The macrosyntactic framework developed in the ANR Rhapsodie project (Lacheret, Pietrandrea & Tchobanov 2014) has proved to be particularly efficient in dealing with the specificities of oral corpora, e.g. piles stacking, disfluencies, repetitions, discourse markers, overlaps, co-enunciation, false starts, self-repairs and truncations. This method is data-driven, inductive (the relevant units are identified through annotation) and modular.
NaijaSynCor is a highly integrated programme divided into 4 work packages (WP), which are interdependent, going from fieldwork and data collection (WP1) to the final characterization of Naija through the study of the annotated corpus (WP4).
- WP1 produces the RNC (Reference Naija Corpus). WP1 will be conducted in Nigeria based on the input of data collected during fieldwork. The RNC files will be constantly uploaded into the main database run by the research team at Llacan in Villejuif and made available to other WPs.
- WP2 will turn 100 out of the 500 Kw of the RNC into a Treebank of deep and fine-grained macro- and microsyntactic annotations (NTB).
- WP3 will conduct an instrumental acoustic analysis of the prosodic features of Naija in relation with its Information Structure.
- WP4, the final analysis of the corpus in terms of relationship between Naija, Nigerian English and Vernacular languages, will be a collaborative effort between all the members of the project, culminating into a final reporting conference. The aim of WP4 is (i) to run a study of the intonosyntax of Naija; (ii) to establish the identity of Naija through its diachronic, diatopic, diaphasic, diastratic and gender variation (Coseriu 1981).
Coordination and time keeping is essential for the success of the NaijaSynCor project. Consequently, a tight schedule has been devised, including reviewing meetings, to make sure that the tasks are completed within the 42 months of the project.
During the first two weeks the members of the project will meet to fine-tune and agree on procedures, standards and workflow. Deliverables will be prepared in the form of training material for language assistants: an orthography guide for transcription, and its companion lexicon of grammatical and common Naija words; a fieldwork questionnaire and guide. Then, for the following three months, while WP1 sets up its working space, trains its team of language assistants, and runs the first round of field trips, WP2 & 3 will test their tools and start working on the ICE Nigeria and (Deuber 95) corpuses (INC & D95C resp.). After this initial 3 months’ trial period, a general meeting will gather all the members and experts to evaluate the results and review the procedures of the different WPs. More precisely, the quality of the annotated corpus samples (INC, D95C & RNC) will be evaluated and recommendations made for the continuing project. At this stage, a number of deliverables should be ready: 5 samples from the Ibadan training sessions will have been aligned, transcribed and marked for macrosyntax, ready for the work of WP2 & 3. If necessary, the 5 samples can be sent back to WP1 to be corrected before use. The orthography guide and lexicon, the fieldwork questionnaire and guide will be revised, ready for use by WP1. Full-scale work will then be ready to start in WP 1, 2 & 3.
WP1. RNC BUILDING AND ANNOTATION
The objective of WP1 is to provide a 500,000 word corpus (i) comparable to the ICE-Nigeria corpus (INC) and the (Deuber 2005) corpus (D05C), (ii) available for fine-grained annotation by WP2 and WP3; and for analysis by all the other WPs.
Methodology and input:In order to evaluate the nativisation of Naija, a total of 384 samples of an average 6 min each will be collected so as to represent the widest scope of functions and locations of Naija in the country. Radio broadcasts will be collected from various local stations, e.g. Wazobia FM (http://wazobiafm.com/lagos). Internet blogs in the form of podcasts (e.g. the blog Reason am! (Meyer 2014)), and other digital oral resources (e.g. Nollywood films) will be collected. On 8 university campuses (Lagos, Abuja, Ibadan, Port-Harcourt, Calabar, Benin City, Kaduna, Jos -- in red on map above), when speakers are found to speak Naija, audio recordings of narrations, interviews, conversations (maximum 3 speakers), etc. will be made. Male and female speakers will be recorded for each genre and place of interview (see Figure 4 for sampling details). The consents of the contributors, either recorded in oral form, or written and signed, will be kept in custody of the RNC repository, and their names will be removed from the public version of the corpus. To ensure comparability between RNC and ICE-Nigeria, the initial sampling of the ICE corpora will be retained, with minor adaptations to the functional specifications of Naija (see table: Sampling of the Naija Corpus). WP1 will produce files that meet the requirements of WP 2, 3 and 4 for their analyses.
- WP2 needs texts that are tokenised into words, and transcribed in a script that allows comparison with the ICE Nigeria and (Deuber 2005) corpuses.
- WP3 needs audio recordings whose quality is good enough to allow automatic acoustic analysis by trained programmes (e.g. Prosogram, Analor). All recordings will be done using professional digital recorders and wireless microphones and/or headsets.
- WP4 needs texts that are equipped with metadata providing the relevant information about the speakers: time, place and conditions of interview; sex, age, education, professional activity, geographic origin, linguistic background and history. A questionnaire will be administered and recorded. The information will be entered into an IMDI database using Arbil (https://tla.mpi.nl/tools/tla-tools/arbil/), an Elan-compatible programme.
Elan (https://tla.mpi.nl/tools/tla-tools/elan/) will be used to annotate the files by providing time alignment, transcription, tokenization into words, semantic word-level glosses and translation into Standard English. Alignment will be done automatically by the program on the basis of 300 ms pauses. Compatibility of transcription will be ensured by using the orthography that I. Deuber developed in her 2005 thesis on Lagos Nigerian Pidgin. This etymological orthography (adapted from the lexifier language orthography, i.e. English) has been chosen by I. Deuber preferably to the phonological script used by linguists (e.g. Faraclas, Elugbe, etc.) as it is spontaneously used by educated Nigerians, and thus easier to teach to transcribers. Codeswitched sections will be identified by dedicated boundaries, e.g. curly brackets. Transcriptions and translations will be double checked for the sake of consistency. A macrosyntactic punctuation will mark macro-syntactic boundaries (i.e. illocutionary units and their main components: nucleus, prenuclei and post nuclei, including discourse markers) and limits between pile layers (disfluencies, reformulation, coordination). All these boundaries are marked by punctuations in written texts. Glossing will be done semi-automatically by bootstrapping an automatic glosser (see WP2) and manual corrections with Elan.
At the onset of the programme, the members of the project will fine-tune and agree on procedures, standards and workflow. Deliverables will be prepared in the form of training material for language assistants: a fieldwork questionnaire and guide; an orthography guide for transcription, and its companion lexicon of grammatical and common Naija words; a guide for macrosyntactic annotation. The first two months of the project will be dedicated to the recruitment and training of language assistants for fieldwork, transcription and annotation. The team of language assistants will be trained, managed and directly supervised by Prof. Ofulue and the Principal Investigator in Ibadan. Training sessions will be organised in Ibadan with C. Chanard for Elan and P. Pietrandrea for macrosyntax. For convenience sake, the logistics of WP1 will be subcontracted to IFRA-Nigeria (UMIFRE 24) or any such well-established institution selected by the PI. At the end of this period, the language assistants will be able to run a survey using the fieldwork guide and questionnaire, and annotate the files using the orthography guide and its companion lexicon. During the training period, a pilot survey will be done on the University Campus of Ibadan, and 5 samples will be annotated by the WP1 team, and sent to Paris for evaluation. After evaluation and reviewing of the protocols by the project members and experts during their early 2017 meeting, the guides and questionnaire will be revised and produced for the full-scale fieldwork and transcription phase. During this phase, after selection by the Principal Investigator, Prof. Ofulue and D. Esizimetor, the .wav (audio) files will be incorporated into the RNC with their metadata. The audio quality of the files is evaluated by WP3 before the transcription starts. Once the alignment, transcription, glossing and translation are done, the text grid is uploaded into the database. Finally, the macrosyntactic annotation is added to the textgrid. After double-checking, the first 100 Kw section of the corpus should be produced by WP1 within 4 months (mid-2017), and the whole corpus 10 months later (beginning 2018) (See diagram in Figure 2). The files will be fed into the database continuously during the process, and the stable version of the first files available to other work packages as soon as mid-2017. C. Chanard will be in charge of setting up and managing the database that will keep track of the workflow through constant updating of the metadata of each file, both for sociolinguistic purposes (IMDI files) and for its annotation status (e.g. whether a particular file has been transcribed, annotated, treated and for what by WP 2 & 3, with what result – quality of annotation, etc.). Once the corpus has been tokenised, glossed and double-checked, Elan will enable us to extract a corpus-based dictionary of Naija. This will be a valuable deliverable since no such dictionary has been published.
This method of building the corpus is backed up by the PI’s experience in creating the IFRA Naija Corpus and the available descriptions of Naija grammatical structure, e.g. (Faraclas 2013).
After 18 months, the deliverables of WP1 are: (1) a guide for orthography; (2) a guide for glossing ; (3) a guide for macrosyntactic annotation; (3) a 500 Kw corpus that is aligned, transcribed, marked for illocutionary units and disfluencies, glossed, and translated; (4) a dictionary of Naija (with POS, gloss and corpus-based examples).
Task Force:B. Caron, C. Ofulue & D. Esizimetor; 4 language assistants (5,000 hours; subcontracted to IFRA-Nigeria on a service basis); Macrosyntactic correction and supervision: S. Kahane & P. Pietrandrea; Corpus manager: C. Chanard.
WP2. ANNOTATION FOR MICRO- AND MACRO-SYNTAX
The main objective of WP2 is to tag, gloss, and parse the 500 Kw corpus produced by WP1, using state of the art NLP tools. In the process, a 100 Kw gold-standard, manually corrected treebank (NTB), will be produced and a dependency parser (MATE, (Bohnet 2010)) will be trained in order to analyse the remaining 400 Kw. Meanwhile, the ICE-Nigeria corpus will be parsed using freely available parsers such as the Stanford parser (de Marneffe & al 2014) or MaltParser (Hall 2006; Nilsson, Nivre & Hall 2006). Both treebanks will be provided with the same tagset (for instance Universal Dependencies, http://universaldependencies.org) in order to allow a contrastive comparison between Naija and Nigerian English, and the evaluation of diachronic, diatopic, diaphasic, diastratic and gender variation in Naija (see WP4). The dependency-based parsing and its tagset will include microsyntactic relations (subject, object, etc.), as well as macrosyntactic relations (prenucleus, discourse marker, etc.)
State of the art parsers trained on manually corrected treebanks of written English have an accuracy of 85 to 93 % for unlabelled attachment score (or UAS, i.e. the number of correct dependencies) and 80 to 90 % for labelled attachment score (or LAS, i.e. the number of correct dependencies, grammatical function included), according to the text domain (McClosky et al. 2012). These scores are due to i) the relevancy of the grammatical choices, ii) the size of the training treebank, iii) the quality of the annotations. In this regard, a critical step will be to create an annotation guideline, which means writing a simplified grammar of Naija. In doing this, we will rely primarily on Faraclas’ seminal description of Naija (Faraclas 1996), which will be challenged by our data. Naija syntax shares with its substrate languages Serial Verb Constructions that correspond either to single verb lexical equivalents (e.g. Naija carry come vs. Eng. bring), or prepositional constructions (e.g. Naija take knife cut X vs. Eng. cut X with a knife) (see Figure 5). Its isolating morphology includes various markers such as plural dem (vs. Eng. suffix -s) or incompletive dey (vs. Eng. be … -ing). Finally, sentence structure in Naija includes frequent use of left-dislocation, whether for topicalisation or clefting, much more than what is expected in oral English.
In order to compare the two languages, Naija and Nigerian English, WP2 will use state of the art NLP tools to tag and parse the 500 Kw corpus produced by WP1, and apply the same method to comparable samples of the ICE Nigeria corpus, for an evaluation of the distance between Naija and its lexifier language (as an indicator of putative decreolisation), and to the (Deuber 2005) corpus for diachronic variation, as an indicator of the speed of this decreolisation.
(Deuber 2005) will provide the starting data in the first three months of the project to experiment, test and determine the tagset to be used for the syntactic annotation. First, a 500 sentence (= 10 Kw) sample (transcribed in etymological orthography and translated into standard English) will be analysed and fully annotated by the expert team, using Arborator (Gerdes 2013) for syntactic annotation and ELAN (Chanard 2014) for glossing and lexicon building. These data can be automatically analysed by the English Stanford parser for the sake of economy. This is made possible by the fact that Naija common orthography is etymological and based on its lexifier language, i.e. English. Nevertheless, the parser output will need several alterations to fit Naija grammar.
The ICE Nigeria English Corpus will also be parsed with the English Stanford parser (or another equivalent parser). 500 sentences will be manually corrected to provide a test corpus for evaluating the parser on this particular variety of English and for comparison with our results on Naija at the end of the project. A new parser can be trained if notable divergences between Nigerian English and Standard American English are noted. Some parts of the Naija treebank can be added for training, especially if code switching between Naija and English is noticed in ICE Nigeria.
The results of the initial tests will be evaluated in a scientific meeting with the whole team of the project to assess the elements that need to be adapted or rewritten in the programme to provide the information necessary for comparison and articulation with the subpart of the project analysing intonation and information structure.
At that point, we will have written a guideline for glossing, tagging (= morphosyntactic annotation) and syntactic annotation and we have produced a treebank that will enable us to bootstrap the final treebank. This is achieved by training statistical tools on our data. This training will be repeated as soon as new data have been manually corrected. A double-blind annotation will be used until we achieve a sufficient inter annotator agreement among non-expert annotators. In the process, the guidelines will be improved and become more user-friendly. This stage will last 11 months, following which the 100 Kw gold-standard treebank (NTB) will be available. Automatic rules will then be used to check the coherence of the treebank, and some additional manual corrections will be done.
For tagging, several tools can be trained, such as CRFs (Conditional Random Fields), which give state of art results (Tellier et al. 2010). Glossing for Naija appears to be particularly simple since the language is isolating and most lemma are identical to the word to be lemmatised (considering that we use the English orthography for lexical words). Only grammatical words and faux amis must be translated. Some words are ambiguous such as the complementizer say ‘that’ or the future auxiliary go (you go think say na fresh meat ‘you'll think it's fresh meat’). We do not know of any previous experiences of automatic glossing but it is a task that can be compared to lemmatisation and we have already several ideas of how to deal with it.
Parsing will be done with the MATE parser (Bohnet 2010), which is a freely available dependency parser that can be trained on any dependency treebank and which we have already used for Spoken French (ANR Orfeo, unpublished) and Old French (Guibon et al. 2015). A part of ICE Nigeria corpus will be added tentatively to the training data, in order to increase the lexical coverage of the parser (and to solve frequent uses of code switching).
Finally, the remaining 400 Kw data will be analysed by the parser trained on the gold-standard treebank, providing the 500 Kw treebank to be compared with ICE-Nigeria and more generally available for use by the other WPs. The input of the parser is the transcription provided by WP1, manually segmented into illocutionary units.
The experience in annotating spoken French acquired in the ANR projects Rhapsodie and Orfeo will constitute a valuable expertise both for the syntactic analysis of spoken language (disfluencies, reformulations, discourse markers, dislocation, etc.) and for the workflow management. As for the grammar of Naija and its possibly challenging aspects, we will be able to count on our competence on some of the substrate languages of Naija such as Yoruba (Aubry 2009; 2010) and Hausa (Caron 2015a). Finally, this work package includes practical hands-on language engineering – which should provide a good base for the development of similar resources for other languages – as well as the development of new methods that include machine learning technology itself in the improvement cycles of the linguistically motivated corpus annotation process.
100 Kw gold standard treebank for Naija (manual correction) ; 400 Kw treebank for Naija (automatic annotation) ; Syntactic annotation guideline for Naija ; Tagger and glosser for Naija ; Dependency parser for Naija (MATE trained on our gold standard treebank); 500 Kw treebank for Nigerian English (ICE Nigeria analysed with English Stanford parser). The whole process is expected to deliver the annotated corpora to WP4 by mid-2018.
S. Kahane, P. Pietrandrea; N. Aubry, K. Gerdes; I. Tellier; language assistants for annotation (400 hours). One engineer (CDD, IE level, 10 person/months) will be recruited to develop the adaptation of the Stanford Tagger and Parser to the specificity of the Naija corpus.
WP3. AN ANALYSIS OF NAIJA PROSODY
The main objective of WP3 is to include a prosodic level to the description of Naija. It will produce (i) an analysis of its prosodic units and their nature, with a description of their precise acoustic correlates, based on an instrumental analysis and validated with a speech synthesis tool, using methods developed in the project “Discourse and Prosody across languages family boundaries” (Schultze-Berndt, Simard & Wegener 2011), hereafter DPLFB; (ii) a version of the 100Kw corpus collected in WP1 annotated for prosody, adapting schemes developed in the treebank Rhapsodie for French, based on perceptual and acoustic cues, developed independently from the micro/macro-syntactic parsing and labelling of WP2 which will serve in the functional analyses of WP4. It will answer questions pertaining to the prosodic system of Naija, including phenomena such as speech rhythm, tonal structure, intonation and stress, e.g.
- Tone: How best to account for the tonal patterns of Naija, in order to categorise it in the typology as 'tone language', 'pitch-accent language' or 'stress language' (Hyman 2006) within the context of African tone languages where differences in pitch are used to convey lexical and grammatical distinctions (Clements 2000); and in view of the existing descriptions, for example Faraclas (1996) who describes a fairly complex tonal system, which in turn, is rejected by Elugbe (2004).
- Interplay of tone and intonation: for those languages that are shown to have tone, it becomes crucial to delimit how pitch can be exploited by both, as highlighted in Caron’s study of Zaar (2015b).
And more widely to the theoretical questions investigated in WP4, e.g.
- Intonation: How are utterances divided into smaller chunks (i.e. how junctures are manifested), and how prominences correlate with (if at all) discourse information structure.
- Whether the prosodic and syntactic structures interact, and to what extent.
- How cogent is prosody in the creation of discourse and informational units.
Finally, our analyses will contribute to one important goal of prosodic typology: identifying the precise inventory of prosodic categories, and comparing their phonetic properties across languages.
A bottom-up methodology is applied, making use of – and adapting - existing NLP (semi)-automating tools of prosodic cleaning, segmentation and labelling such as Analor (Avanzi, Lacheret & Victorri 2008), Prosogram (Mertens 2004), and associated protocols developed in the project Rhapsodie, used to create a prosodic-syntactic treebank for spoken French (Lacheret, Pietrandrea & Tchobanov 2014), a Praat script, ProsodyPro (Xu 2013) to gather acoustic measurements and the speech resynthesis tool PENTATRAINER (Xu & Prom-on 2014) which were used to analyse the prosodic features of four languages in the DPLFB project (Schultze-Berndt, Simard & Wegener 2011) (http://www.uni-bielefeld.de/lili/projekte/discourse_and_prosody/)
As in WP2, a subset of 500 sentence sample from the Deuber 2005 corpus (transcribed in etymological orthography and translated into standard English) will provide the starting data in the first three months of the project to test the interoperability and compatibility of the tools Prosogram (Mertens 2004), ANALOR and PENTATRAINER by the expert team following the workflow detailed in 1-6 below. The dataset will include utterances of different modalities (declarative, interrogative) and of varying length. A workflow will be devised and guidelines for the annotation of prosodic prominences (which will take its inspiration from that developed in Rhapsodie) written for the WP’s research assistants.
The results of the initial tests will be evaluated in a scientific meeting with the whole project team to ensure that this WP provides the information necessary for comparison and articulation with the subpart of the project analysing macrosyntax in WP2 relevant for WP4.
Research assistants will receive training in marking prominences, in a 1-day session after this first scientific meeting. The 500 sentences will be used for this initial training, where the expert team will supervise the annotation cycle, which will always involve two persons: one marking and one checking, in order to avoid idiosyncratic decisions.
1. F0 CLEANING
Following this initial phase, the prosodic analysis and the creation of the annotated corpus will start in earnest. This part of the project will use the NTB 100 Kw corpus, collected, time-aligned and glossed in WP1. Firstly, the selected tokens will be converted into Praat textgrid format (Boersma, Paul & Weenink, David 2016) where the first tier will coincide with the orthographic transcript. Before embarking on any analyses, we will need to correct the pitch tracks (the fundamental frequency contours from a recorded speech signal) otherwise the acoustic analyses could be rendered invalid. Instead of doing these corrections manually, which is extremely time-consuming, the tool Analor will be used to clean the F0 curves in the corpus in order to ensure the correct segmentation of prosodic units, prominences labelling and measurements (Figure 6) in the subsequent steps. The conversion and pitch track correction will be carried out by research assistants and verified by the expert team.
2. SYLLABIC SEGMENTATION
Secondly, the tool Prosogram will segment the transcribed and cleaned speech data provided by Analor into syllables, which will be the unit of analysis for all instrumental acoustic measurements. Prosogram was developed for semi-automatic prosodic transcription; it computes and draws a stylized pitch contour based on a tonal perception model which proceeds in two steps: (i) the algorithm finds the vocalic nuclei with the help of the intensity and voicing parameters within every syllable; (ii) it stylizes the intonation curve on the nucleus into a static or dynamic tone, on the basis of a perceptual glissando approach. Only the first step (i) will be exploited for the NTB 100 Kw corpus. The segmentation will be verified by research assistants.
3. AUTOMATIC SEGMENTATION INTO MACROPROSODIC UNITS
Analor will then be used to generate, firstly, larger prosodic units which we will refer to as ‘macroprosodic units’. The software ANALOR is an ‘informant’ which only has access to prosodic material, its method of analysis relies on the acoustic segmentation of the melodic line and the analysis of pause duration. Its parameterization is transparent, based on the following criteria: (i) Pause: Occurrence of a pause of at least 300 ms; (ii) Pitch lowering: Detection of an F0 pitch movement reaching a certain amplitude, defined as the difference in height between the last F0 extremum and the mean F0 over the entire portion of the signal preceding the pause; (iii) Pitch reset: Detection of a ‘jump’, defined as the difference in height between the last F0 extremum preceding the pause and the first F0 value following the pause. See Figure 7 for an example.
The tool has been used experimentally not only with French but with two Afroasiatic languages: Kabyle (Berber spoken primarily in Kabylie, Northern Algeria) and Hebrew (Semitic, spoken primarily in Israel). (Mettouchi et al. 2007). This approach will be used for Naija.
4. PROMINENCES ANNOTATION AND GENERATION OF THE PROSODIC CONSTITUENTS WITHIN THE MACROPROSODIC UNIT
While it is generally recognised that speech is organised into units of various sizes which group together to make larger units expressed mostly through prosody (Lehiste, Olive & Streeter 1976), making it a plausible language universal, it remains to be decided whether languages share the same set of prosodic constituents, differing in whether and how those constituents are cued in the phonetics (see (Hyman 2011) for related discussion). The issue of the nature of these thresholds/boundaries is still unresolved, bringing us to consider that adaptations will be necessary for an accurate segmentation of the speech flow in Naija. Our hypotheses are based on widely described units such as the prosodic word and the intonation unit, and on Faraclas (1996) who describes for Nigerian Pidgin a system in which individual words are either pitch accented or tonal. In this description, all accented words bear an overt (or underlying) word final low tone; and a ‘phrase stress group’, defined syntactically as being made of a main verb and an adverbial or a non-subject noun phrase which is assigned a single stress, signalled by a falling pitch contour if the final tone is high, and the reverse if it is low. Thus, inside the macroprosodic unit, we hypothesise the following constituents for Naija:
- Prosodic word, typically characterized as being the domain of word stress, phonotactics and segmental word-level rules, here the marking of tone;
- Intonation unit, widely used as the basic unit of analysis. According to Chafe (1994) the IU is a speech unit closely associated with a "coherent intonation contour". The most frequent criteria suggested for the delimitation of an IU are: (1) pause; (2) final syllable lengthening or slow speech rate at the end of an IU, and a following (3) fast speech rate at the beginning of the next IU; (4) pitch reset (corresponding to Faraclas’ “phrase stress group”).
The segmentation into prosodic word and intonation units will be based on the annotation of prominences. The annotation of prominences, its limits and perspectives, has been amply discussed in linguistic corpus studies in the last few years. Three possible pitfalls are identified by researchers (see (Wagner & al 2015) for a review): (i) the subjectivity of the annotation; (ii) the nature of the annotation itself, whether it be categorical, scalar or continuous; (iii) the scale of the annotation, ranging from a manual annotation to a wide scale automatic annotation of a large volume of data.
To limit annotation subjectivity, prosodic prominence will be annotated manually by two research assistants in a cyclical manner and verified by at least one the team’s expert following a strict protocol derived from the protocol developed for the annotation of prominences in French in the Rhapsodie project (http://www.projet-rhapsodie.fr/tuto/Codage%20Prosodique.pdf). Annotations will be indicated on an independent tier according to a two-level scale: a syllable can be strongly prominent (labelled ‘S’) or weakly prominent, (labeled ‘W’). Finally, the robustness of the automatic labelling of prominence will be tested by comparing the automatic annotation provided by Analor to the manual annotation. Another tier in the annotation will concern those disfluencies which are marked prosodically (‘um’, or syllabic extra-lengthening, filled pauses, etc).
5. TONAL ANNOTATION
Finally, stylized melodic contours and tonal annotation will be computed automatically for each constituent in NTB 100 Kw (Rhapsodie Treebank: (Lacheret, Pietrandrea & Tchobanov 2014)). The F0 contour is represented by a set of five acoustic values for each given unit: (i) the initial value of the F0 on the unit, (ii) the final value of the F0 on the unit, (iii) the main saliency, i.e. the value corresponding to the most salient F0 peak – if one exists, (iv) the main saliency position, i.e. the time position of the main saliency, relative to the boundaries of the unit, and (v) the local register which corresponds to the mean F0 over the unit. All frequency values are expressed in semi-tones, with respect to the overall mean F0 of the speaker. Frequency values are represented with respect to 5 pitch levels covering the whole F0 range of the speaker: H (extreme high), h (high), m (medium), l (low), and L (extreme low). Each pitch level covers a range of 4 semi-tones centered on the average F0 value of the speaker. This final step will create time-aligned tiers containing information about the prosodic units of different levels as well as their corresponding pitch contours which will constitute the deliverable part of this WP: a prosodic level to the Naija corpus.
6. PHONETIC MEASUREMENTS AND ANALYSIS BY SYNTHESIS
The next step will use the annotated Praat textgrids to conduct an instrumental analysis in order to determine the melodic primitives of the ‘encoding schemes’ in Naija prosody and verify the structures and prominences discovered in step 4. A Praat script, ProsodyPro (Xu 2013) will extract measurements of the following acoustic correlates for each syllable: mean F0 (based on 10 evenly spaced F0 points from each labelled interval), min and max F0, final F0, pitch excursion, duration, mean intensity, and velocity (rate of change of F0).
Statistical tests will evaluate the reliability of the measurements which will be validated by remodelling the F0 contours of the tokens with the speech synthesis tool PENTATRAINER 2 (Xu & Prom-on 2014). The software extracts the optimal parametric values for the tested communicative functions, in this case the prosodic units and location of prominences, through analysis by synthesis controlled by simulated annealing. These results are then used to generate F0 contours and compare them with individual real utterances in the corpus. The results will confirm the validity of the perceptual prosodic annotation in step 4 and the structures generated. The focus on an acoustic analysis based solely of the most primitive functions of prosody, that is, ‘chunking’ the flow of speech into units and highlighting certain parts of the speech uttered (prominence), will test the tool which has been used successfully to establish the stress patterns in Savosavo (Papuan, Central Solomons) (Simard et al. 2014). It is possible that the reseynthesis cycle will need to be reiterated a number of times and that fine-tuning the annotation schemes will prove necessary for Naija.
The testing of the tools and preparation of guidelines for the research assistants will be conducted by the expert team, together with the Ph.D. candidate, during the first three months of the project (with technical support from the computational experts in WP2). Once the final methodologies and workflows agreed with all researchers (Team Project Meeting), the second phase will take place, i.e. the data analysis proper. Step 1 & 2, the F0 cleaning and the segmentation into syllables with Prosogram, will be conducted by the research assistants. The expert team together with the Ph. D candidate will check the segmentation and exclude/correct erroneous parses before the data is processed with Analor. Step 3 will be supervised by the expert team and the PhD candidate, while in step 4, the annotation of prosodic prominences and disfluencies will first be done by research assistants who will work in teams of (2), checking each other’s work. The expert team will verify the annotations and discuss any irregularities. Step 5 will bring together the expert team and the Ph.D candidate. Step 6 will be conducted by Simard and the Ph.D candidate. The Ph.D candidate’s research project will be a description of the prosodic system of Naija, and he/she is expected to participate in all stages of the analysis up to step 6. The average time needed for each step is presented in the Table below, based on a five-minute long speech file. NB: steps (3) and (4) can be run simultaneously.
|1.||Conversion and correction of pitch track||45 mns|
|2.||Segmentation into syllables with Prosogram, verification||30 mns|
|3.||Automatic segmentation into macroprosodic units with verification||30 mns|
|4.||Annotation of prominences||60 mns|
|5.||Annotation of disfluencies||30 mns|
|5.0||Automatic generation of prosodic structure within the macroprosodic unit||Real time|
|7.||Stylized melodic contours||Real time|
|8.||Instrumental analysis||120 mns|
As in WP2, the experience in annotating spoken French (disfluencies, prominences, etc.) acquired in the ANR project Rhapsodie and in the DOBES project Discourse Prosody across language boundaries will constitute a valuable expertise for the analysis of prosody and for the workflow management.
The integration of tools used in the projects mentioned above into a single workflow will strengthen the existing methodologies for the prosodic analyses of the world’s languages, by automating the more time-voracious tasks such as the correction of pitch tracks, and by including an instrumental acoustic analysis of the encoding at the phonetic level thus supporting the results of Analor with those of PENTATRAINER.
- Manual: Annotation guidelines for the annotation of prosody in Naija.
- 100 Kw treebank for Naija time-aligned tiers containing information about the prosodic units of different levels as well as their corresponding pitch contours; prominences and disfluences distributions.
- Database containing all tokens and the measurements of their prosodic correlates (mean F0, intensity, pitch excursion, velocity, duration, etc), comprising both continuous data like time-normalized F0 contours and F0 velocity profiles suitable for graphical analysis, and discrete measurements suitable for statistical analysis.
A. Lacheret-Dujour; 1 doctoral student (36 months, full-time); language assistants for annotation (400 hours); 1 researcher (C. Simard; 36 months @ 35% = 13 persons/month).
WP4. CORPUS ANALYSIS
Bring together the results obtained in WP2 & 3 to achieve a Global Review of Naija. The aim of this work package is twofold: (i) to explore how the syntactic and intonational components of Naija interact to carry out the task of information packaging, and question the supremacy of syntax in the structure of prosodic units; (ii) to study variation and change in Naija.
Sub-section 4.2 of the work package will be devoted to the intonosyntactic study of Naija. The NTB 100 Kw corpus collected, time-aligned and glossed in WP1, annotated for morphology and syntax in WP2 and for prosody in WP3, will be arranged in a tabular format so as to be queried with Trameur, a powerful textometric tool developed by Serge Fleury (Fleury & Zimina 2014), in order to show how intonosyntactic structures result from compromises and exchanges between internal construction principles (e.g. metrical constraints or internal rules for syntactic linearization) and communicative needs. The same Trameur will be used by sub-section 4.3 for the sociolinguistic study of variation.
Trameur (Fleury 2015) (http://www.tal.univ-paris3.fr/trameur/bases/rhapsodie2trameur-v8.pdf) , (from Fr. trame ‘screen, framework’) is a statistical tool for analysing the behaviour of a given linguistic unit or feature in a corpus. It was initially developed to study word distribution in textometry and discourse analysis (Née et al. 2012). It can be applied to any tokenized text with labels on tokens and has been applied to Rhapsodie’s tabular format (Fleury & Zimina 2014).
Tabular format is particularly suited to encode dependency analysis (see CoNLL format, (Nilsson, Riedel & Yuret 2007)). In such a format, each word (or token) is on a different line and information attached to the word is distributed in columns (morphosyntactic features, governor identifier, grammatical function, time alignment, etc.). Any segmentation of the text (prosodic, macrosyntactic, etc.) can be encoded by indicating the initial and final words of segments (See (Wang & Bawden 2015) for a description of Rhapsodie tabular format).
Trameur allows us to cut the corpus into sections and to study a particular feature or combination of features based on diachronic, diatopic, diaphasic, diastratic, or genre metadata. Trameur evaluates the specificity of the frequency of a given form in a section according to the distribution of this form in the whole corpus and its occurrence probability in the section (see (Lebart & Salem 1994) for the computation of specificity indices). Non-specific forms, that is, which do not exceed a given threshold, are said to be banal and characterize the corpus as a whole and not a particular section. It is thus possible to compute the forms that are over-employed (positive specific forms), as well as the forms that are under-employed (negative specific forms). The specificity of a section can then be identified by its characteristic forms, as well as the forms that do not characterize it at all (Lacheret, Kahane & Pietrandrea s.p.).
Furthermore, Trameur enables us to cut the sections into sub-sections. Samples can thus be cut down into macroprosodic units and/or into illocutionary units. It provides an elegant display of the repartition of a given feature in the sub-sections of a text, for instance the repartition of discourse markers inside illocutionary units in a sample (See Figure 8).
WP4.2 NAIJA INTONOSYNTAX
This task aims at:
- evaluating the diatopic variation affecting Naija through the comparison of the samples collected on campuses and on radios (WP1);
- assessing the degree of creolization (here intended as a social process of nativisation) of Naija as well as at analyzing its possible outcomes in terms of grammatical stabilization and expansion;
- evaluating the functional homogeneity of Naija through the analysis of its diaphasic and diamesic variation in different communicative settings ranging from informal conversations among friends, to Christian services and scripted radio programmes;
- comparing Naija and Nigerian English so as to evaluate the structural discreteness of the two languages.
More precisely, point (1) will take into consideration the diatopic variation due both to internal developments (i.e. grammaticalization and analogy) and contact-induced changes (i.e. borrowing and calquing) induced by the interference of different adstratal languages (e.g. Yoruba, Igbo, Hausa). Conversely, point (2) will focus exclusively on the internal changes affecting Naija in order to evaluate its degree of morphosyntactic complexification when compared to its lexifier language. Points (3) and (4), on their part, will deal with the issue of the post-creole continuum in the attempt to define a spectrum of mesolectal features between the more basilectal and acrolectal varieties of Naija. In this regard, it should be stressed that decreolisation still remains an uncertain notion, insufficiently distinguished from other processes of contact-induced change (Aceto 1999; Siegel 2010) and that we still do not dispose of in-depth analyses of the linguistic influence played by English on Naija. In view of the above, WP4 can potentially contribute to sharpen the structural criteria usually adopted for analysing a post-creole continuum with new data from Naija. As a further matter, given that contact between Naija and English may imply both unintentional morphosyntactic restructuring due to decreolisation and intentional codeswitching toward the lexifier language, WP4 also aims at individuating the syntactic and prosodic constraints of codeswitching and to differentiate it from the process of decreolization on the basis of the instrumental analyses proposed by WP2 and WP3 (cf. Manfredi & Petrollino 2013).
Three hypotheses will be tested:
- Hypothesis #1: Educated Naija is more stabilized due to the geographical and social mobility of its speakers;
- Hypothesis #2: Educated Naija will reveal a greater influence from English with more borrowing and more syntactic restructuring.
- Hypothesis #3: Scripted oral Naija will reveal an even stronger influence from English, with more borrowing and syntactic restructuring than in unscripted oral Naija.
In order to verify the previous hypotheses, the data collected and integrated into RNC will be analyzed both in a qualitative and a quantitative perspective. On the one hand, the qualitative analysis aims at identifying the more salient features in the corpus. The following list gives a few non-exclusive directions of research: (i) vocabulary: loanwords from English and vernacular languages; (ii) semantics: tense and aspect marking; (iii) syntax: Serial Verb Constructions vs. prepositions; (iv) information structure: relative importance of topical vs. thematic constructions. On the other hand, the quantitative analysis will focus on the variation affecting the salient features that have been identified by the Trameur statistical module. Through logistic regression, using the rms package for R, (Harrell 2015), they will be quantitatively correlated in the corpus to the geographical and social variables as identified from the demographic sociolinguistic questionnaires on the one hand, and to the scripted/non-scripted factor on the other hand.
All the members of the project will be involved in the assessment of the corpus, under the responsibility of S. Fleury for WP4.1, B. Caron for WP4.2 and S. Manfredi for WP4.3. A. Mettouchi will participate as an expert in WP4.2; C. Ofulue and F. Egbokhare in WP4.3.
3. PROMOTION AND IMPACT OF THE PROJECT
The 500,000 word RNC and its derived products (tree banks, prosodic representations) will be archived on the project website housed by the Huma-Num TGIR. The resource (audio and annotation) files will be placed on a platform for permanent archiving (CINES-IN2P3) where their metadata will be harvested by the Isidore query platform. All our resources will be distributed in open access with a Creative Commons licence.
An open source web search engine, e.g. Annis (corpus-tools.org), will enable researchers to query the different layers of annotation corresponding to the prosodic and syntactic structures, and view the results graphically, textually and in connection with the sound files. A similar platform is under development within the context of ANR Orféo that will give access to the 10 Gw treebank displayed by the project: http://ortolang107.inist.fr/annis-gui/ (work in progress, not for public use). The corpus will also be made accessible by a simpler interface with a search by keywords and a faceted search for metadata. See the second interface for Orfeo based on Solr here: http://ortolang107.inist.fr.
The results of WP 2 & 3 will be submitted to the relevant peer-reviewed international conferences and scientific journals (e.g. Journal of Phonetics, Phonetica, Speech Communication) for WP3, and mostly NLP conferences for WP2 (e.g. STLU (International Workshop on Spoken Language Technologies for Under-resourced Languages, http://mica.edu.vn/sltu2016/), LAW (Linguistic Annotation Workshop), TLT (Treebank and Linguistic Resources), and ACL (Association for Computational Linguistics), COLING (Computational Linguistics), LREC (Language Resources and Evaluation Conference)) but also NLP journals (e.g. TAL published by Atala). Likewise, the sociolinguistic results of the projects will be submitted to conferences (e.g. SPCL (Society for Pidgin and Creole Linguistics) and NWAV (New Ways of Analyzing Variation) and to journals, e.g. the Journal of Pidgin and Creole Languages).
The final reporting conference on Naija, organised in Ibadan in conjunction with IFRA-Nigeria (UMIFRE 24) and the Institute of African Studies of the University of Ibadan, will gather specialists of corpus studies, sociolinguistics, pidgins and creoles, prosody and NLP to assess the results of the project. These will be published, together with the NLP tools used to produce them, in the form of a reference book on Naija presenting the results of the programme and the tools that were developed to reach those results.
This innovative approach to the dynamics of contact and change in the areas of human behaviour and sociology of language will powerfully impact the methodology and technology of research on emerging languages. It is ground-breaking as, for the first time, it will use new NLP tools that integrate syntax, intonation and information structure on a large deeply annotated corpus to build a gold-standard bench-marking database.
Last but not least, it is hoped that it will provide the annotated data and the NLP tools necessary to produce speech recognition devices that can be implemented in smartphones, opening wide development perspectives in a 160+ million country where a large part of the population is illiterate while having access to modern communication tools used e.g. in dematerialized banking operations via smartphones.
A. Tchobanov, S. Fleury, K. Gerdes, C. Chanard, M. Aouini.
4. THE PARTNERS: RELEVANCE AND COMPLEMENTARITY
The NaijaSynCor project gathers the members of the UMR 8135, Llacan (“Langage, Langues et Cultures d’Afrique Noire”; Inalco-CNRS, Paris-Villejuif), the UMR 7114, Modyco (“Modèles, Dynamiques, Corpus”; Université de Paris Ouest Nanterre La Defense, EHESS, CNRS, Ile-de-France Ouest et Nord), and the best specialists of Naija (Profs. Egbokhare and Ofulue, David Esizimetor for Nigeria; Prof. Deuber for Germany) who will act as experts to advise on and evaluate the Project.
Research at Llacan is based on primary data collected during extended periods of immersion fieldwork in Africa. One of the strong assets of Llacan is the analysis of linguistic policies, the promotion of African languages, and the reflection on the use of modern technologies for digitising and processing linguistic and literary data of African languages. Llacan has coordinated four projects along those lines that have been or are currently financed by the ANR. The Llacan members of this team have acquired a solid expertise in fieldwork, data collection and corpus building as demonstrated by the CorpAfroAs and Cortypo project coordinated by Amina Mettouchi. The Coordinator and PI of NaijaSynCor, B. Caron, once Professor of Hausa at Inalco and a specialist of Chadic languages spoken in Northern Nigeria, has used the expertise he acquired as a member of these projects and his knowledge of Nigeria and Naija acquired through his numerous field trips and extended secondments to IFRA (1990-1992; 2006-2011) to build a pilot corpus of Naija (Caron 2012). This has given birth to this project, which will capitalize on the PI’s experience and the engineering expertise of C. Chanard and M. Aouini in the use and development of Elan-Corpa (Chanard 2014). N. Aubry, a senior lecturer on Yoruba at Inalco, brings to NaijaSynCor his knowledge and expertise on NLP and the typology of Southern Nigeria languages. S. Manfredi, a member of Sedyl (UMR 8202) who has worked with Llacan as a post-doc for the CorpAfroAs project, is a sociolinguist who has specialised in African pidgins, creoles and emerging languages. He will take responsibility in WP4 of the study of variation in the Naija corpus, while the PI will be coordinating the syntactic / intonational / informational component. Finally, C. Simard will join the Llacan team, bringing her expertise on the prosody of lesser-described languages, sp. in relation with language contact and variation (Schultze-Berndt, Simard & Wegener 2011). While she will continue teaching general linguistics as a Lecturing Fellow at SOAS, she will be working for the project as a part-time researcher (35%) to collaborate with A. Lacheret of Modyco on the intonation of Naija, in WP3.
The Modyco team is composed of 4 researchers of Modyco (Kahane, Lacheret, Pietrandrea, Tchobanov) and 3 researchers from Université Paris 3 Sorbonne Nouvelle (Fleury, Gerdes, and Tellier). These researchers are used to working together. They all have been associated to the ANR projects Rhapsodie and/or Orfeo and have already dozens of common publications. Moreover, Kahane and the 3 researchers of Paris 3 are involved in the same NLP master course (www.plurital.org) run jointly by Paris Ouest Nanterre and Paris 3.
Modyco (Modèles, Dynamiques, Corpus) has a great experience in the development of corpora of Spoken French: PFC (Phonologie du Français Contemporain, www.projet-pdfc.net), Colaje (Communication langagière chez le jeune enfant, colaje.scicog.fr), CIEL-F (Corpus International Ecologique de la Langue française, www.ciel-f.org). Modyco is part of the Equipex Ortolang, the goal of which is to promote existing corpora and provide tools for experimental linguistics.
The two teams, Modyco and Llacan, are familiar with each other’s reasearch and methodology. Mettouchi, Lacheret, and Pietrandrea have run jointly a working party of consortium IRCOM (Consortium Corpus Oraux et Multimodaux (http://ircom.huma-num.fr/site/p.php?p=groupetravail1). Caron has already applied the Rhapsodie macrosyntax model to Zaar (Caron, Aouini & Chanard 2015). Kahane has directed a PhD on an African language (Olivier Bondélle, Polysémie du wolof, 2015) and was a reviewer of Aubry's PhD on Yoruba.
The two teams are complementary, as Llacan is specialized in African Language and thus has a strong experience on the development of resources for under-described languages while Modyco has a strong experience in the development of syntactic and prosodic annotations on Spoken French corpora (ANR Rhapsodie and Orfeo). Both teams have developed tools for corpus development (ELAN for transcription and glossing, Arborator for dependency annotation, Analor for prosodic annotation, as well as a platform for Orfeo distribution) and are fully operational for the WP they have in charge.