course

Quantitative methods in historical linguistics

Undergraduate seminar given at the University of Konstanz, Summer semester 2018.

1. Course description

Recent years have witnessed a surge of formal and computational methods in the traditionally qualitative field of historical linguistics: these methods include quantitative tools for establishing language family relationships, describing patterns of change in historical corpora, and modelling change in the framework of dynamical systems theory. In this course, we take a look at some of these techniques in a hands-on approach: the goal is not just to learn about the techniques, but to learn to apply them in practice. To this end, the course comprises plenty of computer exercises, both on empirical data and on theoretical models. No previous computer programming experience is required.

2. Slides

  1. Introduction to the course (18 April 2018)
  2. The comparative method (2 May 2018)
  3. Subgrouping and trees (9 May 2018)
  4. Computational phylogenetics 1 (16 May 2018)
  5. Computational phylogenetics 2 (23 May 2018)
  6. Databases and corpora 1 (6 June 2018)
  7. Databases and corpora 2 (13 June 2018)
  8. Databases and corpora 3 (27 June 2018)
  9. Dynamical systems 1 (4 July 2018)
  10. Dynamical systems 2 (11 July 2018)
  11. Dynamical systems 3 (18 July 2018)

3. Tutorials

  1. Introduction to R (25 April 2018)
  2. Basic plots in R (27 June 2018)

4. Portfolio exercises

  1. Portfolio exercises (all combined)

5. Datasets

  1. asjp-small.csv (copyright)
  2. ellegard.csv (copyright)
  3. ellegard_full.csv (copyright)
  4. german_fortition.csv (copyright)
  5. wals_IE.csv (copyright)

6. Scripts

6.1 cretest.R – likelihood ratio test for constant rate effects

(download)

# cretest.R / Henri Kauhanen 2018
#
# A likelihood ratio test for constant rate effects (pitting a CRE
# model against a null model with a fixed slope across contexts),
# using Wilks' theorem.

cretest <- function(alt,
                    null) {
  RSS_alt <- 0
  pars_alt <- 0
  n_alt <- 0
  for (a in alt) {
    RSS_alt <- RSS_alt + deviance(a)
    pars_alt <- pars_alt + nrow(summary(a)$parameters)
    n_alt <- n_alt + length(summary(a)$residuals)
  }
  RSS_null <- 0
  pars_null <- 0
  n_null <- 0
  for (a in null) {
    RSS_null <- RSS_null + deviance(a)
    pars_null <- pars_null + nrow(summary(a)$parameters)
    n_null <- n_null + length(summary(a)$residuals)
  }
  if (n_alt != n_null) {
    stop("Error: unequal numbers of data points")
  }
  LR <- 0.5*n_alt*log(RSS_null/RSS_alt)
  chi <- 2*LR
  df <- pars_alt - pars_null
  p <- pchisq(chi, df=df, lower.tail=FALSE)
  cat("Likelihood ratio test\n\n")
  cat(paste("     L-ratio:", round(LR, 3), "\n"))
  cat(paste("  chi-square:", round(chi, 3), "\n"))
  cat(paste("          df:", df, "\n"))
  cat(paste("     p-value:", round(p, 3), "\n"))
  invisible(list(chisquare=chi, LR=LR, df=df, p=p))
}

6.2 hdist.R – compute Hamming distance matrices

(download)

# hdist.R / Henri Kauhanen 2018
#
# Computes a Hamming distance matrix for a set of languages. Input is
# in the form of a matrix (or dataframe) of language vector, each row
# corresponding to a language, each column to a feature. Distances
# are normalized (divided by the number of features) if requested.

hdist <- function(df,
                  normalize = TRUE) {
  distmtx <- matrix(NA, nrow=nrow(df), ncol=nrow(df))
  languages <- rownames(df)
  for (i in 1:nrow(df)) {
    for (j in 1:nrow(df)) {
      # we only need a lower-diagonal matrix as distances are symmetric
      if (i > j) {
        x <- df[languages[i], ]
        y <- df[languages[j], ]
        if (length(x) != length(y)) {
          stop("x and y not of same length!")
        }
        if (normalize) {
          N <- length(x)
        } else {
          N <- 1
        }
        hamm <- sum(x != y)
        distmtx[i,j] <- (1/N)*hamm
      }
    }
  }
  colnames(distmtx) <- languages
  rownames(distmtx) <- languages
  distmtx
}

6.3 ldist.R – compute Levenshtein distance matrices

(download)

# ldist.R / Henri Kauhanen 2018
#
# Computes a Levenshtein distance matrix for a set of languages. Input is
# in the form of a matrix (or dataframe) of language vector, each row
# corresponding to a language, each column to a word/cognate. Distances
# are normalized (divided by the number of words) and length-corrected
# (divided by the longer of two words in each comparison) if requested.

ldist <- function(df,
                  normalize = FALSE,
                  correct = FALSE) {
  distmtx <- matrix(NA, nrow=nrow(df), ncol=nrow(df))
  languages <- rownames(df)
  for (i in 1:nrow(df)) {
    for (j in 1:nrow(df)) {
      # we only need a lower-diagonal matrix as distances are symmetric
      if (i > j) {
        x <- df[languages[i], ]
        y <- df[languages[j], ]
        if (length(x) != length(y)) {
          stop("x and y not of same length!")
        }
        if (normalize) {
          N <- length(x)
        } else {
          N <- 1
        }
        leven <- 0
        for (k in 1:length(x)) {
          a <- x[k]
          b <- y[k]
          if (correct) {
            C <- max(nchar(a), nchar(b))
          } else {
            C <- 1
          }
          leven <- leven + adist(a, b)/C
        }
        distmtx[i,j] <- (1/N)*leven
      }
    }
  }
  colnames(distmtx) <- languages
  rownames(distmtx) <- languages
  distmtx
}

7. Animations

7.1 NLS fit of logistic function

7.2 Error landscape of the fit