Complex Segment Learner

Complex Segment Learner

Observed/Expected Counting Utility

This utility will calculate the Observed/Expected statistic (Trubetzkoy 1939) in a learning data file. The calculation can be over local/adjacent strings only, or over nonlocal/non-adjacent strings. For example, if you are interested in the co-occurrence of [p, t, k], the local calculation will only count one substring of [p a t k a], "t k". The non-local calculation will count both "p, t" and "t, k", as well as "p, k". The formula for the calculation is as follows:

  1. OBSERVED: this is the simple count of how often a sequence occurs in your data.
  2. EXPECTED: sums the probability of the first segment being the first segment (out of all segments in consideration); plus the probability of the second segment (out of all segments), divided by the sum of all the attested pairs of segments under consideration.
  3. In short: O/E = N(S1S2) / (N(S1)*N(S2)/N(all pairs))
  4. Note that O/E is a relative measure, not an absolute one. For example, the calculation can change for [p t k] if the segments [b d g] are added to the set you are counting, compared to just [p t k].

The requirements for the input file are the same as for the other utilities on this site; see Help.

Language name

Learning data file

Sequences to count

Select if you want only adjacent segments to be counted.