2. Data management, model description and testing

The formula-data interface is a critical advantage of the R software. It provides a practical way to describe the model to be estimated and to store data. However, the usual interface is not flexible enough to deal correctly with random utility models. Therefore, mlogit provides tools to construct richer data frames and formulas.

Data management

It comes with several data sets that we’ll use to illustrate the features of the library. Data sets used for multinomial logit estimation concern some individuals, that make one or a sequential choice of one alternative among a set of mutually exclusive alternatives. The determinants of these choices are covariates that can depend on the alternative and the choice situation, only on the alternative or only on the choice situation.

To illustrate this typology of the covariates, consider the case of repeated choices of destinations for vacations by families:

The unit of observation is therefore the choice situation, and it is also the individual if there is only one choice situation per individual observed, which is often the case. Such data have therefore a specific structure that can be characterized by three indexes: the alternative, the choice situation and the individual. These three indexes will be denoted alt, chid and id. Note that the distinction between chid and id is only relevant if we have repeated observations for the same individual.

Data sets can have two different shapes: a wide shape (one row for each choice situation) or a long shape (one row for each alternative and, therefore, as many rows as there are alternatives for each choice situation). mlogit deals with both format. It depends on the dfidx function of the dfidx package which takes as first argument a data frame and returns a dfidx object, which is a data frame in “long” format with a special data frame column which contains the indexes.

Wide format

data("Train", package = "mlogit")
Train$choiceid <- 1:nrow(Train)
head(Train, 3)

  choiceid id choice price_A price_B time_A time_B change_A change_B
1        1  1      A    2400    4000    150    150        0        0
2        2  1      A    2400    3200    150    130        0        0
3        3  1      A    2400    4000    115    115        0        0
  comfort_A comfort_B
1         1         1
2         1         1
3         1         0

This data set contains data about a stated preference survey in Netherlands. Each individual has responded to several (up to 16) scenarios. For every scenario, two train trips are proposed to the user, with different combinations of four attributes: price (the price in cents of guilders), time (travel time in minutes), change (the number of changes) and comfort (the class of comfort, 0, 1 or 2, 0 being the most comfortable class). This “wide” format is suitable to store choice situation (or individual specific) variables because, in this case, they are stored only once in the data. Otherwise, it is cumbersome for alternative specific variables because there are as many columns for such variables that there are alternatives.

For such a wide data set, the shape argument of dfidx is mandatory, as its default value is "long". The alternative specific variables are indicated with the varying argument which is a numeric vector that indicates their position in the data frame.² This argument is then passed to stats::reshape that coerced the original data frame in “long” format. Further arguments may be passed to stats::reshape. For example, as the names of the variables are of the form price_A, one must add sep = "_" (the default value being "."). The choice argument is also mandatory because the response has to be transformed into a logical value in the long format. In “wide” format, there is no alternative index. The choice situation situation index is not mandatory as tere is one line for each choice situation. In this data set, there is a choice situation index called id and it is nested in the individual index called choiceid. To take the panel dimension into account, idx is a named vector of lenght one, the value being the choice situation index and the name a the individual index. The idnames argument is used to give relevant name for the second index, the NA in the first position indicating that the name of the first index should be unchanged.

Tr <- dfidx(Train, shape = "wide", varying = 4:11, sep = "_",
            choice = "choice",
            idx = c(id = "choiceid"), idnames = c(NA, "alt"),
            opposite = c("price", "comfort", "time", "change"))

Note the use of the opposite argument for the 4 covariates: we expect negative coefficients for all of them, taking the opposite of the covariates will lead to expected positive coefficients. We next convert price and time in more meaningful unities, hours and euros (1 euro was $2.20371$ guilders):

print(Tr, n = 3)

~~~~~~~
 first 3 observations out of 5858 
~~~~~~~
  choice price time change comfort idx
1   TRUE   -10   -2      0      -1 1:A
2  FALSE   -18   -2      0      -1 1:B
3   TRUE   -10   -2      0      -1 2:A

~~~ indexes ~~~~
  choiceid id alt
1        1  1   A
2        1  1   B
3        2  1   A
indexes:  1, 1, 2

An idx column is added to the data, which contains the three relevant indexes: choiceid is the choice situation index, alt the alternative index and id is the individual index. This column can be extracted using the idx funtion:

head(idx(Tr), 3)

  choiceid id alt
1        1  1   A
2        1  1   B
3        2  1   A
indexes:  1, 1, 2

Long format

ModeCanada,³ is an example of a data set in long format. It presents the choice of individuals for a transport mode for the Ontario-Quebec corridor:

data("ModeCanada", package = "mlogit")
head(ModeCanada, 3)

  case   alt choice dist  cost ivt ovt freq income urban noalt
1    1 train      0   83 28.25  50  66    4     45     0     2
2    1   car      1   83 15.77  61   0    0     45     0     2
3    2 train      0   83 28.25  50  66    4     25     0     2

There are four transport modes (air, train, bus and car) and most of the variable are alternative specific (cost for monetary cost, ivt for in vehicle time, ovt for out of vehicle time, freq for frequency). The only choice situation specific variables are dist (the distance of the trip), income (household income), urban (a dummy for trips which have a large city at the origin or the destination) and noalt the number of available alternatives. The advantage of this shape is that there are much fewer columns than in the wide format, the caveat being that values of dist, income and urban are repeated four times. For data in “long” format, the shape and the choice arguments are no more mandatory.

To replicate published results later in the text, we’ll use only a subset of the choice situations, namely those for which the 4 alternatives are available. This can be done using the subset function with the subset argument set to noalt == 4 while estimating the model. This can also be done within dfidx, using the subset argument.

The information about the structure of the data can be explicitly indicated using choice situations and alternative indexes (respectively case and alt in this data set) or, in part, guessed by the dfidx function. Here, after subsetting, we have 2779 choice situations with 4 alternatives, and the rows are ordered first by choice situation and then by alternative (train, air, bus and car in this order).

The first way to read correctly this data frame is to ignore completely the two index variables. In this case, the only supplementary argument to provide is the alt.levels argument which is a character vector that contains the name of the alternatives in their order of appearance:

Note that this can only be used if the data set is “balanced”, which means than the same set of alternatives is available for all choice situations. It is also possible to provide an argument idx which indicates the name of the variable that contains the names of the two indexes (choice situation and alternatives in this order). If only the the second index is indicated, the first element of the vector should be set to NA:

The name of the variable that contains the information about the choice situations can be indicated through the idx argument:

More simply, as the two indexes are stored in the first two columns of the original data frame, the idx argument can be unset:

and the indexes can be kept as stand alone series if the drop.index argument is set to FALSE:

MC <- dfidx(ModeCanada, subset = noalt == 4, idx = c("case", "alt"),
            drop.index = FALSE)
print(MC, n = 3)

~~~~~~~
 first 3 observations out of 11116 
~~~~~~~
  case   alt choice dist   cost ivt ovt freq income urban noalt
1  109 train      0  377  58.25 215  74    4     45     0     4
2  109   air      1  377 142.80  56  85    9     45     0     4
3  109   bus      0  377  27.52 301  63    8     45     0     4
       idx
1 109:rain
2  109:air
3  109:bus

~~~ indexes ~~~~
  case   alt
1  109 train
2  109   air
3  109   bus
indexes:  1, 2

Model description

Standard formulas are not very practical to describe random utility models, as these models may use different sets of covariates. Actually, working with random utility models, one has to consider at most four sets of covariates:

The first three sets of covariates enter the observable part of the utility which can be written, alternative $j$:

\[ V_{ij}=\alpha_j + \beta x_{ij} + \nu t_j + \gamma_j z_i + \delta_j w_{ij} . \]

As the absolute value of utility is irrelevant, only utility differences are useful to modelise the choice for one alternative. For two alternatives $j$ and $k$, we obtain:

\[ V_{ij}-V_{ik}=(\alpha_j-\alpha_k) + \beta (x_{ij}-x_{ik}) + (\gamma_j-\gamma_k) z_i + (\delta_j w_{ij} - \delta_k w_{ik}) + \nu(t_j - t_k). \]

It is clear from the previous expression that coefficients of choice situation specific variables (the intercept being one of those) should be alternative specific, otherwise they would disappear in the differentiation. Moreover, only differences of these coefficients are relevant and can be identified. For example, with three alternatives 1, 2 and 3, the three coefficients $\gamma_1, \gamma_2, \gamma_3$ associated to a choice situation specific variable cannot be identified, but only two linear combinations of them. Therefore, one has to make a choice of normalization and the simplest one is just to set $\gamma_1 = 0$.

Coefficients for alternative and choice situation specific variables may (or may not) be alternative specific. For example, transport time is alternative specific, but 10 mn in public transport may not have the same impact on utility than 10 mn in a car. In this case, alternative specific coefficients are relevant. Monetary cost is also alternative specific, but in this case, one can consider than 1$ is 1$ whatever it is spent for the use of a car or in public transports. In this case, a generic coefficient is relevant.

The treatment of alternative specific variables doesn’t differ much from the alternative and choice situation specific variables with a generic coefficient. However, if some of these variables are introduced, the $\nu$ parameter can only be estimated in a model without intercepts to avoid perfect multicolinearity.

Individual-related heteroscedasticity (see Swait and Louviere 1993) can be addressed by writing the utility of choosing $j$ for individual $i$: $U_{ij}=V_{ij} + \sigma_i \epsilon_{ij}$, where $\epsilon$ has a variance that doesn’t depend on $i$ and $j$ and $\sigma_i^2 = f(v_i)$ is a parametric function of some individual-specific covariates. Note that this specification induces choice situation heteroscedasticity, also denoted scale heterogeneity.⁴. As the overall scale of utility is irrelevant, the utility can also be writen as: $U_{ij}^* = U_{ij} / \sigma_i = V_{ij}/\sigma_i + \epsilon_{ij}$, i.e., with homoscedastic errors. if $V_{ij}$ is a linear combination of covariates, the associated coefficients are then divided by $\sigma_i$.

A logit model with only choice situation specific variables is sometimes called a multinomial logit model, one with only alternative specific variables a conditional logit model and one with both kind of variables a mixed logit model. This is seriously misleading: conditional logit model is also a logit model for longitudinal data in the statistical literature and mixed logit is one of the names of a logit model with random parameters. Therefore, in what follows, we’ll use the name multinomial logit model for the model we’ve just described whatever the nature of the explanatory variables used.

To describle the model to be estimated, mlogit uses Formula objects provided by the Formula package.⁵ The Formula package provides richer formulas, which accept multiple responses (a feature not used here) and multiple set of covariates. It has in particular specific model.frame and model.matrix methods which can be used with one or several sets of covariates.

To illustrate the use of Formula objects, we use again the ModeCanada data set and consider three sets of covariates that will be indicated in a three-part formula, which refers to the first three items of the four points list at start of this section.

Some parts of the formula may be omitted when there is no ambiguity. For example, the following sets of formulas are identical:

By default, an intercept is added to the model, it can be removed by using + 0 or - 1 in the second part.

A model.frame method is provided for dfidx objects. It differs from the formula method by the fact that the returned object is an object of class dfidx and not an ordinary data frame, which means that the information about the structure of the data is not lost. Defining a specific model.frame method for dfidx objects implies that the first argument of the function should be a dfidx object, which results in an unusual order of the arguments in the function (the data first, and then the formula). Moreover, as the model matrix for random utility models has specific features, we add a supplementary argument called pkg to the dfidx function so that the returned object has a specific class (and inherits the dfidx class):

MC <- dfidx(ModeCanada, subset = noalt == 4,
            alt.levels = c("train", "air", "bus", "car"),
            pkg = "mlogit")
class(MC)

[1] "dfidx_mlogit" "tbl_dfidx"    "dfidx"        "tbl_df"      
[5] "tbl"          "data.frame"

f <- Formula(choice ~ cost | income  | ivt)
mf <- model.frame(MC, f)
class(mf)

[1] "dfidx_mlogit" "tbl_dfidx"    "dfidx"        "tbl_df"      
[5] "tbl"          "data.frame"

Using mf as the argument of model.matrix enables the construction of the relevant model matrix for random utility model as a specific model.matrix method for dfidx_mlogit objects is provided.

print(model.matrix(mf), n = 3)

# [11116 x 11]
  (Intercept):air (Intercept):bus (Intercept):car   cost income:air
1               0               0               0  58.25          0
2               1               0               0 142.80         45
3               0               1               0  27.52          0
  income:bus income:car ivt:train ivt:air ivt:bus ivt:car
1          0          0       215       0       0       0
2          0          0         0      56       0       0
3         45          0         0       0     301       0

The model matrix contains $J-1$ columns for every choice situation specific variables (income and the intercept), which means that the coefficient associated to the first alternative (train) is set to 0. It contains only one column for cost because we want a generic coefficient for this variable. It contains $J$ columns for ivt, because it is an alternative specific variable for which we want alternative specific coefficients.

Testing

As for all models estimated by maximum likelihood, three testing procedures may be applied to test hypothesis about models fitted using mlogit. The set of hypothesis tested defines two models: the unconstrained model that doesn’t take these hypothesis into account and the constrained model that impose these hypothesis.

This in turns define three principles of tests: the Wald test, based only on the unconstrained model, the Lagrange multiplier test (or score test), based only on the constrained model and the likelihood ratio test, based on the comparison of both models.

Two of these three tests are implemented in the lmtest package (Zeileis and Hothorn 2002): waldtest and lrtest. The Wald test is also implemented as linearHypothesis (or lht) in package car (Fox and Weisberg 2019), with a fairly different syntax. We provide special methods of waldtest and lrtest for mlogit objects and we also provide a function for the Lagrange multiplier (or score) test called scoretest.

We’ll see later that the score test is especially useful for mlogit objects when one is interested in extending the basic multinomial logit model because, in this case, the unconstrained model may be difficult to estimate.

Bibliography

Ben-Akiva, M., D. Bolduc, and M. Bradley. 1993. “Estimation of Travel Choice Models with Randomly Distributed Values of Time.” Papers 9303. Laval - Recherche en Energie. https://ideas.repec.org/p/fth/lavaen/9303.html.

Bhat, Chandra R. 1995. “A Heteroscedastic Extreme Value Model of Intercity Travel Mode Choice.” Transportation Research Part B: Methodological 29 (6): 471–83. https://doi.org/10.1016/0191-2615(95)00015-6.

Forinash, C. V., and F. S. Koppleman. 1993. “Application and Interpretation of Nested Logit Models and Intercity Mode Choice.” Transportation Record 1413: 98–106.

Fox, John, and Sandford Weisberg. 2019. An R Companion to Applied Regression. Third. Thousand Oaks CA: Sage.

Koppelman, Franck S., and Chieh-Hua Wen. 1998. “Alternative Nested Logit Models: Structure, Properties and Estimation.” Transportation Research B 32 (5): 289–98.

Koppelman, Frank S., and Chieh-Hua Wen. 2000. “The Paired Combinatorial Logit Model: Properties, Estimation and Application.” Transportation Research Part B: Methodological 34 (2): 75–89. https://doi.org/10.1016/S0191-2615(99)00012-0.

Meijer, Erik, and Jan Rouwendal. 2006. “Measuring Welfare Effects in Models with Random Coefficients.” Journal of Applied Econometrics 21 (2): 227–44. https://doi.org/10.1002/jae.841.

Swait, J., and J. Louviere. 1993. “The Role of the Scale Parameter in the Estimation and Use of Multinomial Logit Models.” Journal of Marketing Research 30.

Zeileis, Achim, and Yves Croissant. 2010. “Extended Model Formulas in R: Multiple Parts and Multiple Responses.” Journal of Statistical Software 34 (1): 1–13. https://doi.org/10.18637/jss.v034.i01.

Zeileis, Achim, and Torsten Hothorn. 2002. “Diagnostic Checking in Regression Relationships.” R News 2 (3): 7–10.

Used by Ben-Akiva, Bolduc, and Bradley (1993) and Meijer and Rouwendal (2006).↩︎
Actually the shape argument is not mandatory if the varying argument is set, as this latter argument is only relevant for data sets in “wide” format.↩︎
Used in particular by (Forinash and Koppleman 1993), Bhat (1995), Franck S. Koppelman and Wen (1998) and Frank S. Koppelman and Wen (2000).↩︎
This kind of heteroscedasticity shouldn’t be confused with alternative heteroscedasticity ($\sigma^2_j \neq \sigma^2_k$) which is introduced in the heteroskedastic logit model described in vignette relaxing the iid hypothesis ↩︎
See Zeileis and Croissant (2010) for a description of the Formula package.↩︎