Skip to content

Word Field Analysis

Nils Reiter edited this page May 25, 2018 · 25 revisions

2.1.0

This guide is also available as a vignette in the R console: vignette(Word-Field-Analysis).


A word field is basically of a list of words that belong to a common theme / topic / semantic group.

Word Field Definition

For demo purposes (this is really a toy example), we will define the word field of Love as containing the words Liebe and Herz. In R, we can put them in a character vector:

wf_love <- c("liebe", "herz")

We will test this word field on Emilia Galotti, which should be about love.

data("rksp.0")

Single word field

The core of the word field analysis collecting statistics about a dictionary. Therefore, we use the function called dictionaryStatisticsSingle() (single, because we only want to analyse a single word field):

dstat <- dictionaryStatisticsSingle(
  rksp.0$mtext,         # the text we want to process
  wordfield=wf_love,   # the word field
  column="Token.lemma" # we count lemmas instead of surface forms
)

summary(dstat)
##   corpus      drama                figure        x     
##  test:13   rksp.0:13   angelo         :1   Min.   : 0  
##                        appiani        :1   1st Qu.: 0  
##                        battista       :1   Median : 2  
##                        camillo_rota   :1   Mean   : 3  
##                        claudia_galotti:1   3rd Qu.: 5  
##                        conti          :1   Max.   :10  
##                        (Other)        :7

We can visualise these counts in a simple bar chart:

# remove figures not using these words at all
dstat <- dstat[dstat$x>0,] 

barplot(dstat$x,                   # what to plot
        names.arg = dstat$figure,  # x axis labels
        las=3,                     # turn axis labels
        cex.names=0.6,             # smaller font on x axis
        col=qd.colors              # colors
      )

Apparently, the prince is mentioning these words a lot.

Obviously we would want to normalize according to the total number of spoken words by a figure:

dstat <- dictionaryStatisticsSingle(
  rksp.0$mtext,       # the text we want to process
  wordfield=wf_love, # the word field
  normalizeByFigure = TRUE,   # apply normalization
  column = "Token.lemma"
)

# remove figures not using these words at all
dstat <- dstat[dstat$x>0,]

barplot(dstat$x, 
        names.arg = dstat$figure,  # x axis labels
        las=3,             # turn axis labels
        cex.names=0.8,     # smaller font on x axis
        col=qd.colors
      )

Multiple Word Fields

The function dictionaryStatistics() can be used to analyse multiple dictionaries at once. To this end, dictionaries are represented as lists of character vectors. The (named) outer list contains the keywords, the vectors are just words associated with the keyword.

New dictionaries can be easily like this:

wf <- list(Liebe=c("Liebe","Herz","Schmerz"), Hass=c("Hass","hassen"))

This dictionary contains the two entries Liebe (love) and Hass (hate), with 3 respective 2 entries. Dictionaries can be created in code, like shown above. In addition, the function loadFields() can be used to download dictionary content from a URL or a directory. By default, the function loads this dictionary from GitHub (that we used in publications), for the keywords Liebe and Familie (family).

wf <- loadFields()
names(wf)
## [1] "Liebe"   "Familie"

The function loadFields() offers parameters to load from different URLs via http or to load from plain text files that are stored locally. The latter can be achieved by specifying the directory as baseurl. Entries for each keyword should then be stored in a file named like the keyword, and ending with txt (by default, can be overwritten). See ?loadFields for details. Some of the options can also be specified through dictionaryStatistics(), as exemplified below.

The following examples use the baseurl https://raw.githubusercontent.com/quadrama/metadata/ec8ae3ddd32fa71be4ea0e7f6c634002a002823d/fields/, i.e., a specific version of the fields we have been using in QuaDramA.

baseUrl <- "https://raw.githubusercontent.com/quadrama/metadata/ec8ae3ddd32fa71be4ea0e7f6c634002a002823d/fields/"
wfields <- loadFields(fieldnames = c("Liebe", "Krieg", "Familie", "Ratio", "Religion"), 
                      baseurl = baseUrl)
dstat <- dictionaryStatistics(
  rksp.0$mtext,  # the text
  fields = wfields,
  names = TRUE,                 # use figure names (instead of ids)
  normalizeByFigure = TRUE,   # normalization by figure
  normalizeByField = TRUE,    # normalization by field
  column = "Token.lemma"        # lemma-based stats
)
colnames(dstat)
## [1] "corpus"   "drama"    "figure"   "Liebe"    "Krieg"    "Familie" 
## [7] "Ratio"    "Religion"

The variable dstat now contains multiple columns, one for each word field.

Bar plot by figure

We can now display the distribution of the words of a single figure over these word fields:

barplot(as.matrix(dstat[9,4:8]), # we select Emilia's line
        main="Emilia's speech",    # plot title
        col=qd.colors             # more colors
        )

Bar plot by field

… but we can also analyse who uses words of a certain field how often

barplot(as.matrix(dstat$Liebe),
        main="Use of love words", # title for plot
        beside = TRUE,            # not stacked
        names.arg = dstat[[3]],  # x axis labels
        las=3,                    # rotation for labels
        col=qd.colors,            # more colors
        cex.names = 0.7           # font size
        )

Dictionary Based Distance

Technically, the output of dictionaryStatistics() is a data.frame. This is suitable for most uses. In some cases, however, it’s more suited to work with a matrix that only contains the raw numbers (i.e., number of family words). Calculating character distance based on dictionaries, for instance. For these cases, the function offers a parameter asList. If it is set to TRUE, the function returns a list with four to six components. The first three components (corpus, drama, figure, Number.Act, Number.Scene) contain meta data, while the fourth component (mat) contains the values as a matrix.

You typically want to assign rownames to this matrix. If you’re studying individual characters, their name might be a good idea.

dsl <- dictionaryStatistics(rksp.0$mtext, 
                            fields = wfields,
                            normalizeByFigure=TRUE,
                            asList=TRUE)
dsl$mat
##                        Liebe       Krieg     Familie       Ratio
## angelo           0.001474926 0.002949853 0.002949853 0.007374631
## appiani          0.006178288 0.004413063 0.009708738 0.006178288
## battista         0.000000000 0.000000000 0.009852217 0.000000000
## camillo_rota     0.009433962 0.009433962 0.000000000 0.047169811
## claudia_galotti  0.005147403 0.004679457 0.013570426 0.004211511
## conti            0.013089005 0.000000000 0.005235602 0.002617801
## der_kammerdiener 0.000000000 0.000000000 0.000000000 0.000000000
## der_prinz        0.005222402 0.002881325 0.005042319 0.003781740
## emilia           0.006347863 0.003385527 0.023698688 0.004655099
## marinelli        0.006537102 0.003180212 0.009363958 0.004063604
## odoardo_galotti  0.003825780 0.004414361 0.015303119 0.006180106
## orsina           0.005064146 0.002700878 0.005739365 0.004726536
## pirro            0.000000000 0.000000000 0.008196721 0.002732240
##                      Religion
## angelo           0.0014749263
## appiani          0.0000000000
## battista         0.0000000000
## camillo_rota     0.0000000000
## claudia_galotti  0.0028076743
## conti            0.0000000000
## der_kammerdiener 0.0000000000
## der_prinz        0.0019809112
## emilia           0.0038087177
## marinelli        0.0010600707
## odoardo_galotti  0.0029429076
## orsina           0.0006752194
## pirro            0.0027322404

Every character is now represent with five numbers, which can be interpreted as a vector in five-dimensional space. This means, we can easily apply distance metrics supplied by the function dist() (from the default package stats). By default, dist() calculates Euclidean distance. The data structure here is now similar to the one in the weighted configuration matrix, which means everything from this vignette can be applied here.

cdist <- dist(dsl$mat)
cdist
##                       angelo     appiani    battista camillo_rota
## appiani          0.008576233                                     
## battista         0.010727547 0.009789698                         
## camillo_rota     0.041230126 0.042548483 0.050000566             
## claudia_galotti  0.011876731 0.005272340 0.009372194  0.045589856
## conti            0.013176341 0.009995957 0.014124025  0.045985347
## der_kammerdiener 0.008725781 0.013786849 0.009852217  0.049020306
## der_prinz        0.005620639 0.005890912 0.008771328  0.044413377
## emilia           0.021616936 0.014616078 0.016722984  0.049292418
## marinelli        0.008829181 0.002713518 0.008409666  0.044659307
## odoardo_galotti  0.012800639 0.006744784 0.010520928  0.044493260
## orsina           0.005327082 0.004742606 0.008523610  0.043580472
## pirro            0.007844699 0.008903535 0.004203682  0.047194792
##                  claudia_galotti       conti der_kammerdiener   der_prinz
## appiani                                                                  
## battista                                                                 
## camillo_rota                                                             
## claudia_galotti                                                          
## conti                0.012839727                                         
## der_kammerdiener     0.016067651 0.014338287                             
## der_prinz            0.008765600 0.008689166      0.008900903            
## emilia               0.010339048 0.020407039      0.025486492 0.018806504
## marinelli            0.004994935 0.008561460      0.012576478 0.004628360
## odoardo_galotti      0.002951752 0.015099654      0.017752832 0.010782503
## orsina               0.008370175 0.008766389      0.009416830 0.001772273
## pirro                0.008914242 0.013695566      0.009061816 0.006869623
##                       emilia   marinelli odoardo_galotti      orsina
## appiani                                                             
## battista                                                            
## camillo_rota                                                        
## claudia_galotti                                                     
## conti                                                               
## der_kammerdiener                                                    
## der_prinz                                                           
## emilia                                                              
## marinelli        0.014610523                                        
## odoardo_galotti  0.008998903 0.007223067                            
## orsina           0.018288736 0.004015547     0.010158260            
## pirro            0.017231492 0.007666719     0.009826303 0.006869313

In particular, we can show these distances in a network:

require(igraph)
## Loading required package: igraph

## Warning: package 'igraph' was built under R version 3.4.4

## Loading required package: methods

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union
g <- graph_from_adjacency_matrix(as.matrix(cdist), 
                                 weighted=TRUE,     # weighted graph
                                 mode="undirected", # no direction
                                 diag=FALSE         # no looping edges
                                )

# Now we plot
plot.igraph(g, 
            layout=layout_with_fr,       # how to lay out the graph
            main="Dictionary distance network",  # title
            vertex.label.cex=0.6,         # label size
            vertex.label.color="black",   # font color
            vertex.color=qd.colors[5],    # vertex color
            vertex.frame.color=NA,        # no vertex border
            edge.width=E(g)$weight*100    # scale edges according to their weight 
                                          # (since the distances are so small, we multiply)
            )  

Development over the course of the play

Since version 2.0 of this package, we can also count word fields by segment, i.e., by act or scene in a play. To do that, we add the parameter segment="Act" when we call the function dictionaryStatistics(). With the function individualize(), we can then re-organize this output so that each character has its own list.

dsl <- dictionaryStatistics(rksp.0$mtext, 
                            fields = wfields,
                            normalizeByFigure=TRUE,
                            asList=TRUE,
                            segment="Scene")

dsl <- regroup(dsl)

We can now easily plot the theme progress over the course of the play:

matplot(dsl$marinelli$mat,type="l",col="black")
legend(x="topleft",legend=colnames(dsl$marinelli$mat),lty=1:5)

Or cumulatively add up the numbers:

mat <- apply(dsl$marinelli$mat,2,cumsum)
matplot(mat,type="l",col="black")

# Add act lines
al <- rle(dsl$marinelli$Number.Act)
abline(v=cumsum(al$lengths[-1]))

legend(x="topleft",legend=colnames(mat),lty=1:5)