We now have over 100,000 words in the database. Data entry and source acquisition has continued.

Things will go into a bit of a hiatus while I move the grant from Rice to Yale.

I see it’s been a while since I posted an update.

  • Data entry continues. We have over 50,000 records now.
  • I’ve given a couple of talks on the Karnic subgroup work this last few months, at UCSD, Yale, the LSA, and will be giving another at Rice on Thursday.
  • Source acquisition has been going well.
  • I’ll be writing an interim report for submission by the end of the month. I’ll post that here.
  • Data entry and source material gathering continues.
  • We have begun entering body part and basic vocabulary data and work continues on the proofing of Curr vocabulary items. Over three quarters of the Curr lists are done now.
  • Some database maintenance was needed. That was done. Backing up from the entry database now occurs when a particular language has been entered, rather than at the end of the session. That has made it much easier for me to keep track of what has been entered into the master database.
  • I have added a number of Karnic wordlists to the ‘overall’ database, from my sources and more recent work.
  • We received several donations of electronic data and several further promises, which are greatly appreciated.
  • There are now two off-site backup locations [in Washington, DC and Rochester, NY].

(The blog has had 463 hits and the website 163.)

Here is the first set of information regarding the database structure I am using for the comparative database. Links will follow in a separate post.

Tables:

* Sources
* Languages
* Data [Data I've imported, only Karnic at this stage]
* Reconstructions
* Personnel (not used in the end)
* Labor
* Notes (not really used much)
* BasicData [Basic vocab lists which RAs are typing]
* CurrData [Curr wordlists]
* CurrSources [Source information from Curr]
* BodyData [Body part data]
* CurrEnglishList [list of English glosses in the Curr lists to
facilitate sorting]

TABLE_Languages

* LangRecordNumber [autogenerated]
* Variety [equivalent to doculect; the name used by the source]
* Language [standardised name]
* Notes (data quality and orthography notes. Not filled in at present)
* Subgroup
* Group
* Family [None of these are used at present; to be filled in as
evidence emerges]
* Attested [to allow disambiguation of proto-languages from
attested languages, since we'll be entering reconstructions from other
sources]
* AIATSIS_Code
* Source [Links to source database, with bibliographic and other
details]

TABLE_variety

* Variety [linked to Language Table, must exist in TABLE_Languages]
* RecordNumber
* OriginalForm [word as given in source]
* PhonemicisedForm [standardised; generated periodically from
script for recent sources, then checked]
* PartofSpeech
* ParadigmNote
* Gloss [gloss as given in source]
* SemanticField [standardised; I have about 15 at the moment]
* GeneralisedGloss ['cover' gloss, e.g. neutralising 'belly' vs
'stomach', 'angry' vs 'cheeky'. To be filled in by a script at some point]
* CognateCode [links to Reconstructions database]
* LoanCode [links to Reconstructions database]
* LoanSource
* EtymologicalNote
* Source [Links to source database]
* +housekeeping fields involving record creation and modification

The Curr, Body Part and Basic data tables are all structured identically.

TABLE_Reconstructions

* RecRecordNumber [Unique Identifier]
* ReconstructionLevel [Proto-language. Note that this isn't linked
to anything at present]
* Form
* Gloss
* PartofSpeech
* SemanticField
* Status
* LoanCode [housekeeping field; probably unnecessary]
* Note
* other housekeeping fields

If you’ve downloaded my paper on modelling Karnic using NeighborNet, outload it. There’s a fairly major error in the data coding which skewed everything. It’s now fixed and I’ll repost the paper with the corrected data soon.

Please find below a list of the Curr (1886) vocabulary lists which have been processed as part of this grant. Most of the lists were already typed up and appear in ASEDA. We have proofread the lists.

We will be adding more information about the modern names of the varieties described here over the next few months.

Claire was on fieldwork in October:

  • Yolngu fieldwork and sociolinguistic observation.
  • Did a lot more work on Yan-nhangu.
    • Dictionary work
    • Translation of examples into Dhuwal
    • Community work
    • Learner’s guide and dictionary publication negotiations
  • Made inquiries about who was in Milingimbi and what language groups were represented
  • More interviews about who speaks what, which languages are closest to Yan-nhangu
  • Informal education for literacy and oral history

Back in Houston:

  • Curr tying
  • Body part and basic vocab data entry
  • That is, continued data entry

To do:

  • the backup system could cope with some refinement.
  • an off-site backup system/schedule is urgently needed (and will be fixed by today)

Dictionary:

  • translated most of the headwords into Djambarrpuyŋu/Dhuwal
  • translated all of the example sentences into Djambarrpuyŋu/Dhuwal
  • checked all the questionable entries
  • compiled cultural information for the majority of the flora and fauna entries
  • consulted about picture for front cover

Grammar:

  • Continued with grammatical elicitation, although this was less a priority than dictionary work on this trip

Learner’s guide:

  • Discussed format, contents and other information with people
  • Discussed the front cover

Yolngu Dialectology:

  • Conducted a number of informal interviews about language attitudes
  • Made contact with a speaker of Gorlpa (closest related language to Yan-nhangu) to discuss the possibility of doing some work.
  • Other interviews were left until the next field trip, where it would be possible to travel to other communities. The intervention has made accommodation difficult and there wasn’t enough time to get the appropriate permissions from all concerned.

This update’s a bit early, because I’ll be in Adelaide for ALS next week.

  • Data entry started in earnest. Data entry is proceeding for the Curr wordlists (cut short by the discovery that they’ve already been mostly keyboarded and are in ASEDA, so focus has moved to checking and preparing the data for importing)
  • Entry of body part data and basic vocab has started.
  • We’ve been obtaining data left, right and centre (my guess would be more than 100 individual works).
  • We’ve had four promises of data for the morphology database.
  • PI is preparing for a field trip to Milingimbi in October.
  • The PI will be in Adelaide for the ALS/ILC conference and will be giving a brief presentation about the project in the ILC conference, as well as talking about work from the project in the ALS main session.
  • The first off-site data backup of the main database was done.

To do:

  • a list of the Curr vocabularies, along with modern language names (where known) will appear on the web site asap.
  • The processed data will be redeposited with ASEDA.

Rationale

In short, it would be useful to have a searchable morpheme list for Pama-Nyungan languages. Therefore, I am compiling one as part of NSF CAREER grant “Pama-Nyungan and the Prehistory of Australia”. Such a database is a major undertaking, though. Therefore, I’m putting out a general request for data contributions.

I’m well aware of the problems of reconstructing morphology in isolation, so this will not be a reconstructions database per se. On the other hand, when reconstructing, say, Karnic, it’s very useful to know whether other languages (in the region or elsewhere) have an ablative morpheme -mu. Currently there’s no easy way to find out this sort of thing.

What the database will contain

I will be distributing an Excel file (and other formats, if requested). The file will be an export from my main Filemaker database, which further includes reconstructions, source lists, language/variety/doculect lists, and other information. A sample of what the morphology database will look like is available here.

Exported database fields:

RecordNumber
DateEntered
Contributor
Variety
Source
OriginalForm
StandardisedForm
OriginalGloss
StandardisedGloss
Environment
PartOfSpeech
OtherNote

Who will have access?

The database will be on a password-protected site. Anyone who has contributed data will be given the password. If you need access to the database and aren’t in a position to contribute data, please send me an email outlining why you want access to the database.

A database like this will be updated frequently. Therefore, things will get very confusing if there is more than one source for the database file. Furthermore, I need to track usage and download statistics as part of the grant conditions. I have decided to make this downloadable (rather than queriable online) because I assume that will be of more use to users. You may not pass on the database (or the password) to any third party. Please refer interested parties to me instead.

How will updates work?

Updates will be available regularly throughout the project — at this stage, I anticipate that updates will be released approximately 4 times per year, although more frequent updates may also be considered, depending on how much data we are able to include.

Contributing data

What languages are needed?

Any and all Pama-Nyungan languages. A list of languages for which data has been contributed (along with any notes about completeness and the source of the data) will appear on the download page. The utility of the database depends to a great extent on how many languages we can include.

Published or unpublished can be contributed. However, if the data are unpublished and you are not the collector, I need some sort of statement that the collector gives permission to pass on the data. We do not have the resources to check this sort of thing. If the collector is not in a position to give permission (e.g. because they’re no longer with us), we’ll need some other indication that we are not going against their wishes (or violating anyone’s copyright) by including the data.

Is there a due date for contributions?

No, although the bulk of the project will run for the next three years, and the sooner we get the data, the sooner it can be included.

What format does the data need to be in?

We will accept data in just about any electronic format. That includes examples and tables cut and pasted from Word documents, text files, lexical databases, and so on. However, you will make our lives vastly easier if you send us structured data (e.g. Shoebox backslash coded data, excel spreadsheets, Filemaker data, xml data, etc.).

Here is a template for data entry, in Excel. A sample with Yolŋu data can be found here.

Please avoid abbreviations (e.g. we would like to avoid situations where it’s impossible to tell if IMP means ‘imperative’ or ‘imperfect’). If you do use abbreviations, please also include a key.

Does the data need to be complete?

No, however the more comprehensive the data, the better the database will be. You can also send us data in installments, however please don’t resend earlier data that’s already been included. We will do our best to avoid duplication.

What if I see an error in the database?

If you find an error, please email us a copy of the relevant entries, along with a description of the nature of the error (typo, wrong language, wrong source, etc) and we’ll fix it.

What will be done with the data once it’s submitted?

We will add your data to the database, standardise the orthography (while retaining the orthography of the original in a separate field), and convert the structure and glosses to a format which allows for standard searching across the database.

We will be working on this project as time permits (in addition to the PI, there are about 5 research assistants).

How should the files be submitted?

Please email files to proto.pama.nyungan@gmail.com.

Quoting from the database:

I’d be grateful if you acknowledged the database in your work. Please quote the following:

Bowern, Claire (compiler) 2007. Pama-Nyungan morphological database. Version X. URL: http://www.owlnet.rice.edu/~ppny/morphologydatabase.htm

The version information will appear in the file name of the database and on the main database page.

What about lexical information?

A lexical database is also in progress. It will contain body parts and basic vocab and is part of a long-term reconstruction project. While the morphology database is being compiled as a side-project, the lexicon project is a much bigger undertaking and we hope the full database will be published at the end of the project, along with reconstructions. Further information will be available shortly. We will be humbugging specialists in different areas, but we’d also be happy to accept digital data donations. Please contact me at the above email address for more information.

Next Page »