Little Lost Lambda: How Any Randomized Controlled Trial or Opinion Poll Could Be Misleading

Posted on May 29, 2025 by Yves Smith

Yves here. Terry Flynn, a regular in the comments section and an expert in statistical analysis, has let us publish a critique of a fundamental issue in many regressions, which is omitting lambda. This sort of thing is over my pay grade, so I hope my introduction is not off the mark.

Some esteemed academics such as economics “Nobel” prize winner Daniel McFadden have tried explaining the conundrum that Terry tackles below, but they seem to have lost patience. Terry has depicted himself as an early MMTer when the discipline had not figured out how to communicate their construct well to a broader audience. Here is one important paper from 1993: The Role of the Scale Parameter in the Estimation and Comparison of Multinomial Logit Models. This one gives another window into the issues: Combining sources of preference data.

One can infer that given how long a few key academics have been breaking their picks on this rock that the mainstream is unwilling to consider something this geeky and prefers to stick to simpler if cruder methods. If you read ECONNED, you may recall that Benoit Mandelbrot determined in the 1960s that financial markets didn’t exhibit the tidy randomness of normal distributions, but were instead Levy distributions, which were not tractable. There were kinda-sorta less bad approximations, but no less than William Sharpe, the father of the capital asset pricing model, conceded that the wild randomness of actual market behavior meant his models collapsed. But then he and others went back to touting them because everyone preferred having a deceiving model to none.

You can now understand why astrology is so popular among money managers.

It took Nassim Nicholas Taleb, first in Fooled by Randomness and later the Black Swan, to communicate one of the big failures of this and many other modeling attempts, the “fat tails” problem, in a way that led to some public recognition. But the bad old ways are very much intact.

In other words, it should come as no surprise that inconvenient findings have not yet gotten much of a following. I hope mathematically literate readers can give Terry some feedback and increase mainstream understanding of this major shortcoming. An issue that affects a lot of randomized clinical trials ought to be of serious concern.

This short overview may help:

Lamda is indeed weird: it actually is simply the inverse of the variance.

So pseudo betas from logit or probit are true betas divided by sigma (standard deviation). Early choice modellers decided to talk about the inverse of the variance: how consistent a person is in choosing an item, probably so they didn’t have to keep thinking the small pseudo betas we see among (historically the elderly) could be genuine small beta or large sigma. If you flip things and talk about consistency rather than inconsistency it was just one of those silly things that many humans could more quickly get their head around.

By Terry Flynn, a former professor of Health Services Research specializing in quantifying human/patient preferences. Originally published at his website

A medical expert on NakedCapitalism, recently mentioned an example of his kid’s teacher setting a mathematical equation to estimate two unknowns [a]: something most numerate people will recognise as impossible. While one teacher mistake isn’t scary, the fact our statistics programs just choose an arbitrary solution when encountering this phenomenon regularly in fields as diverse as Randomised Controlled Trials (RCTs) through to political polling is very concerning.

To help maximise engagement, I will illustrate this using a hypothetical political opinion poll. However, I will go on to explain how people in “hard sciences” should be equally worried, if not more so, since the problem is less well known among the RCT community, and will offer guidance to help practitioners.

I was once told “explain what you’re going to teach, teach it, then explain what you taught” so I’ll try to broadly follow that.

Here is is fundamental issue:

A pattern of “respond/not responded) (two percentages) or a pattern of votes across five political parties (a+b+c+d+e=100) is explainable by an infinite number of “mean affinities” (often interpreted as party loyalties but could be in a trials sense, “average response on some unobservable scale” which is the “true beta”) and “variances” (how variable you might be in giving the core response). Just so we are on the same page, although sigma (the standard deviation) when squared equals the variance, is usually referred to in equations, for reasons lost to the mists of time, choice modellers didn’t like to quote what the equation spits out: beta/sigma. Thinking of a large quoted value as a possibly large beta or a small sigma seemed to confuse some early people (because it is the MEAN divided by the VARIANCE) so they decided to work with the INVERSE of sigma, lamda, interpretable as “how consistent you are”. So small sigma (small variance) is high lamda (high consistency in response. So all this stuff is because what the program gives you is a large parameter which might be due to a large beta or a small sigma and you have NO IDEA which: high affinity (true beta) or low variance (how consistency – true lamda). The program gives you one times the other and you have NO IDEA what is driving the value.

So, when choosing a political party, each beta is in a fact a perfect mix (“confound”) of two things:

How strong is the support for each candidate?
How certain/consistent is an individual person’s support for each candidate?

These are multiplied together so you have literally no way to split them back up into their two parts. In other words, the first is a measure of “how much you identify with a candidate” (mean – the “true beta”, somewhat confusingly typically labelled V in the math equations since V is often used to represent utility of a proxy for it) whilst the second is a measure of “how often you’ll stick with the candidate” (inverse of the variance which you can think of as consistency which mathematically we represent as lamda – as I said above, this is the INVERSE of sigma (the standard deviation) so measure CONSISTENCY NOT VARIANCE – high consistency (low variance) is generally GOOD).

Your preferred statistical package cannot separate beta from lamda – it can only give beta multiplied by lamda when it uses the (log)likelihood [b]. So it simply assumes that lamda is one: all the reported effects are implied to be true betas when they are not – they are true beta multiplied by lamda. For every person and every choice/medical intervention. To see how this might give you a very misleading picture about “what is going on” I’ll use a UK constituency election example. However, “medical brain trust” people bear with me, as I’ll go on to show how dangerous this statistical practice can be in the context of clinical studies including possibly the rush to get mRNA vaccines to market.

What’s the problem with the likelihood function?

The likelihood function for logit models and the one for probit models[a] is an equation where the “beta” estimates it spits out are NOT in fact the true betas (measures of how strong the effect of each variable is in explaining your observed outcome). This is where the classic “two variables, one equation” problem comes in.

So, when choosing a political party, each beta is in a fact a perfect mix (“confound”) of two things:

How strong is the support for each candidate?
How certain/consistent is an individual person’s support for each candidate?

In other words, the first is a measure of “how much you identify with a candidate” (mean – the “true beta”, somewhat confusingly typically labelled V by economists) whilst the second is a measure of “how often you’ll stick with the candidate” (inverse of the variance which you can think of as consistency which mathematically we represent as lamda).

Your preferred statistical package cannot separate beta from lamda – it can only give beta multiplied by lamda when it uses the (log)likelihood [b]. So it simply assumes that lamda is one: all the reported effects are implied to be true betas when they are not. For every person and every choice/medical intervention. To see how this might give you a very misleading picture about “what is going on” I’ll use a UK constituency election example. However, “medical brain trust” people bear with me, as I’ll go on to show how dangerous this statistical practice can be in the context of clinical studies including possibly the rush to get mRNA vaccines to market.

An example using a very “middle England” constituency

Here are some hypothetical numbers that look like results from a UK opinion poll conducted in a “middle England” constituency. However, the reader should try to keep in mind that they are actually the “internal tendency to support each party for one individual, called Jo”. Jo simply said “conservative” but these are her internal percentages. The reasons for this will become clear. Until mid 2024 the Conservatives were the main right-wing party of government (somewhat like the US Republicans), Reform UK were attacking them from the right on issues like keeping out of the EU and DEI (MAGA type messages). Labour was supposedly centre-left (US Democrats) with a very mixed attitude toward the EU, whilst the Lib Dems and Green positioned themselves as forces to the middle or left who wanted back into the EU, some of whom could be quite libertarian but with a strong “left-wing tilt amongst Greens anyway”.

These percentages are effectively our “y” dependent variables in a logit or probit model based on likelihoods: what set of betas (our independent variables indicating “level of party affinity”) is most likely to give these percentages. Yet unless you read the appendix to these models in the manuals to a program like Stata, or work in one of only a few fields that encourage you to think about what the individual is doing (mathematical psychology or n-of-1 RCTs) then you won’t realise that what you are getting are not betas, but betas_times_lamdas! The program has decided “ok set lamda to be one” encouraging you to think that all of these levels of support came from innate levels of strength of support for each party, rather than any consistency (or lack thereof) in support. Putting out misinformation about rivals to reduce consistency of support for them is a great way to artificially boost your percentage of the vote if you realise you can’t boost your “real beta – strength of support” easily.

A brief delve into the weeds “how an individual responds” in terms of the mathematics

I won’t cause people’s eyes to glaze over with a complete discussion of the likelihood function that translates “proportions” into “pseudo-betas” (pseudo because they’re confounded with lamdas). Somewhat surprisingly, it wasn’t until the mid 1980s that the theoretical proof of the likelihood function doing this was published by Yatchew and Griliches[1][c].

I’ll simply use the crucial bit of the likelihood exploited by the field of choice modelling (and clinical trials but I’ll come to those shortly). In areas like academic marketing, random utility theory is used[2]. This was developed by Thurstone in 1927 and was at its heart a signal-to-noise way of conceptualising human choice (probably why mainstream economists generally dislike it: people are meant to conform to things like transitivity so the idea they might make mistakes is anathema). However, Thurstone was actually thinking at the level of the individual participant (think of our voter called Jo): how often they chose item i over some set of items y=1,2,…. tells you how much they value i over any other member of the set of y-1. Hence the core equation:

This is not as scary as it looks. The probabilities (observed frequencies) are the left hand side and the numbers your favourite stats program gives you for your (in this case five) parties are the lamda-times-V. Perhaps you’ve spotted the problem. It cannot separate V from lamda so it sets lamda to be one (it “normalises it”). So ALL the variation in those percentages above are explained by “betas” which are in fact a perfect mix of V (the true affinity/strength) and lamda (consistency). Also, it is not exactly how clinical trials work but the problem at their heart is identical, as described in reference 1.

In short, the “vote shares” (probabilities) in Table 1 below are the left hand side. These need to be explained by five “lamda-Vs”.The V is the utility (“true affinity score or utility”). So if all affinity scores were zero (“meh to every party”) then each probability would be exp(0)=1 divided by the five exponents (each being one) so 1/5=20%. So that figures. As the V (true affinity score) increases for a given Party, then its contribution to the total (the numerator over the denominator) increases. Here’s what Stata et al do:

Table 1: How Stata interprets percentages to give “betas”

The reader who wants to check can simply calculate the exponential of (affinity score times lamda) for each party, to give the figure in column 4, The sum of the figures in column 4 is 28.09. 13.46 is 48% of 28.09, which is the Conservative percentage, and so on. Note, once again, that the stats package would not normally know these affinity scores (the “true betas”). It would use the logit function to translate the column 2 percentages into the column 4 figures from part 1, but by assuming lamda=1 for every party would get the affinity scores in final column.

So, according to pollsters, we have numbers purporting to show Jo’s level of affinity with each party. As the raw “percentages” suggest, she feels most aligned with the Conservatives, then Labour, with Reform UK, the Lib Dems and Greens in that order a distant 3rd, 4th and 5th. However, this could just as easily be a potentially deeply misleading characterisation of her views.

Table 2 shows an alternative set of numbers that produces the EXACT SAME PERCENTAGES.

Table 2: Re-interpreting percentages as if Jo were “Old School anti-EU Labour”

Again, we use Jo’s “internal pseudo-percentage support levels” in the standard logit equation to work out what her “V times lamdas” are. Again, I’m “being God” and knowing what her levels of consistency (her lamdas) are for every party. Again, I can get “correct” levels of her support (affinity scores) for each party. I simply use the “percentages”, together with my “god-like knowledge of her certainty/consistency” with the logit equation to solve for “affinity scores”. These are our “real betas” presented in the last column. THESE are true levels of “intrinsic support” for each party. Turns out she’s like many people who started going Labour in 2024. Jo’s low internal percentage support for Reform might simply reflect that the she’s far more certain about the establishment Conservatives continuing to deliver on BREXIT than the somewhat populist Reform UK.

Table 3: Re-interpreting percentages as if Jo were “Conservative with unclear EU views”

So Stata and other programs assumed that the contents of Table 1 represented Jo’s thought process. I’ve made up two completely different scenarios (Table 2 and 3) that give the same percentages. If that makes you worried then you should be. In the tables above I “played God” by knowing the true split between V and lamda but pollsters DO NOT.

These percentages conceal some crucial aspects of Jo’s thinking

So……“Jo types” could, via different lamdas, be people who were the classic “Blue Wall Conservative” who could have voted either way in the BREXIT referendum and who only had truck with Labour and the Conservatives and very possibly the Lib Dems? Or that their data were equally consistent with a person resident in part of the “broken-and-now-rebuilt-red-wall” in the Midlands who actually were “old school Labour” and who lent support to the right because she thought Labour were in the pockets of the EU and wanted to kick the establishment in the teeth? Both explanations are possible from these data.

So far I’ve showed how a logit model (the workhorse of political voting models) interpreted observed percentage levels of support. It should be noted that reputable pollsters apply sampling weights to ensure that they have interviewed sufficient supporters of every party of interest. Under-representing certain parties, or those who actually will go out and vote come election day, will immediately lead to a bad prediction.

Note that the clever psephologist who either had a SECOND dataset that had some function of the affinity scores in there, or used a lot of qualitative insights into the local constituency, might alter the lamdas to get “more correct” affinity scores and thereby tease out what is really going on. If you hear the term “Multi-level Regression and Post-stratification” then that’s their fancy way of saying they’re drawing on auxiliary information in order to try to avoid the traps I’ve described above.

What does this all mean for our psephologist trying to tell us what is going on out there?

In an era of increasingly sophisticated marketing and targeting and Artificial Intelligence being used to confuse people, the ones in the first group – supposedly “strong” Tories – might not turn out to vote if their levels of uncertainty are ramped up via social media lies. Conversely, other parties can reduce the Tory vote if they know “Tory tribalism is low” and the Tories are doing well merely because these other parties are not producing clear messages that cause voters’ certainty regarding policy to increase.

Polling implications

I cheated by “knowing” how much our hypothetical voter Jo identified with each party. However, I showed that some realistic values, given UK experience, could mean her “relative levels of party support” are consistent with various profoundly different types of constituency result in the UK.
This should make psephologists and statisticians very wary. “Turnout” can no longer be used as their “get out of jail free card” when they predict wrongly.
Researchers in discrete choice modelling know full well already that they cannot aggregate Jo’s data with other survey participants UNTIL AND UNLESS YOU HAVE NETTED OUT ANY DIFFERENCES IN THEIR CONSISTENCY (VARIANCES). Crucially, the Central Limit Theorem does NOT apply here [1]
Researchers must own up to the fact that EVERY opinion poll is an equation with two unknowns and therefore insoluble.
YouGov grasped the nettle in 2017 by administering a second survey to try to get a handle on “intrinsic attitudes”. That “alternative model” was the only “official model” to correctly predict that Prime Minister Theresa May was about to lose her overall majority in Parliament, having called a surprise General Election. (By chance I had a political national survey in field at that time and also predicted this – I made money at the bookies but no media were interested.)

So how might this play out in Medicine?

As with all “first attempts” at illustrating the intuition behind some quite heavy duty concepts, I’ve still had to make some simplifications: the “random utility equation” is merely a (I hope) fairly easily understood subset of the likelihood function for logit models. I hope that what I’ve ended up writing at least helps the readers who have asked me for “the intuition” to feel that they get a better handle on why “one-shot” percentages can be so dangerous when it comes to interpreting “what’s going on at the fundamental level of THE INDIVIDUAL HUMAN”.

YOU SEE A JPEG WHEN YOU REALLY NEED TO SEE AN MPEG.

The paradigm I worked in for 20 years – Random Utility Theory – was all about modelling an individual human. Treating the “errors” not as some “bad thing” but as simply a characteristic of our decisions can be incredibly empowering when it comes to quantifying what we really value. To be colloquial – signal vs noise tells us a LOT and gives us proper NUMBERS.

First of all, it is important to recognise that “variance” in responses quoted from RCTs etc refers to “variability between individuals” NOT “variability within a given individual” (since as I have shown above, a “one shot” study – like an RCT – cannot shed light on this). So let me give you something to think about. Suppose two patients both respond in a one-shot RCT to the active treatment. However, suppose, if we’d done the trial 16 times (8 times receiving the active treatment, 8 times the placebo, with adequate washout periods between rounds) we observed that patient A responded all 8 times whilst patient B respondent 5 out of 8 times. You’d probably want to know if there is something going on to explain this difference.

Patients in RCTs typically are heavily screened to ensure nobody with a potentially confounding condition might be enrolled which could compromise the results. But suppose there is an unobserved difference between patients that causes the 8/8 people to be mixed in with the 5/8 people? Yatchew and Grilisches proved that you CANNOT aggregate these two groups to get an unbiased estimate of the “true beta – strength of effect” unless you “net out” the differences in consistency first. But if your painstaking attempts to ensure a very homogeneous sample do not, in fact, do anything to address this then your work is all in vain.

I’m no geneticist but I can hypothesise reasons why the 5/8 persons are allowed to be aggregated with the 8/8 people if you’ve rushed early stages of drug development. Suppose a combination of 3 genes ensures you’ll ALWAYS respond to the treatment, but having only 1 or 2 of the 3 gives you only a partial response (5 out of 8 times). This touches upon the issue of the fast rollout of the mRNA vaccines: not that they are necessarily bad, but merely that you should never rush things. It is the early stage trials with genetic tests that offers one potential way to separate the “5/8” people from the “8/8” people and might give you insights WHY. Because you really don’t want to “move fast and break things” and have no idea why some people are consistently responding to something whilst others, for reasons unknown, have inconsistent responses over time (or worse, suffer specific side effects).

The solution?

When I was doing my PhD (in understanding ways to analyse cluster RCTs – where you must randomise, for example schools, or whole hospitals, to avoid contamination of the two arms) I had to learn Fortran if I were to get the simulations done in 3 years. A colleague learnt Fortran with me: but for her, it was to enable simulations “going in the opposite direction” – learning how the individual might display inconsistency in response. Her work contributed to the field of the “n-of-1 trial”. This is a kind of trial that recognises that consistency might vary across patients and you must adjust for it. Unfortunately that type of trial never really took off: those trials are resource intensive and obviously expensive and when the statistical establishment (with some exceptions) don’t even recognise the weakness in their models, you’re not going to change the paradigm.

But to any of the “medical brain trust” who made it this far – if you see stuff that “seems off” given what evidence based medicine has taught you, maybe the problem is the trial, not with you. Judgment can be our best friend. I spent 15 years accumulating it in recognising patterns of the “betas” spit out by Stata and why they “did not compute”.

What did I just try to teach?

An RCT or poll is a jpeg. You need an mpeg. Just knowing relative positions is NOT enough. You need to know “intrinsic ability” together with “consistency”. Certain studies will tell you “there are segments with different levels of ability” and/or “consistency”. However, if you are to ACT upon this you really must have done enough work to ensure you understand what is DRIVING (in)consistency, otherwise you might get a nasty shock you run a study in a new sample.

NOTES:

[a] https://www.nakedcapitalism.com/2025/05/thinking-being-offloaded-to-ai-even-in-elite-medical-programs.html#comment-4219558

[b] Daniel McFadden won the “pseudo Economics Nobel” and is one of only about 3 people I think deserved it. Whether he knew already that TWO datasets were required to design the BART in the Bay Area or just wanted a 2^nd “real usage” dataset to calibrate his model, he clearly solved the problem by having two equations for two unknowns.

[c] Probit models are difficult when the number of outcomes is above 2 so I’ll gloss over these but the principle is the same. Since the probit function, when graphed, is so similar to the logit, many practitioners just use the logit to avoid the estimation complications of the probit.

Bibliography

[1] https://www.jstor.org/stable/1928444?origin=crossref: Adonis Yatchew and Zvi Griliches. Specification Error in Probit Models
The Review of Economics and Statistics Vol. 67, No. 1 (Feb., 1985), pp. 134-139

[2] Thurstone LL. A law of comparative judgment. Psychological Review, 34, 273-286 (1927).

Print Friendly, PDF & Email

Subscribe to Post Comments
30 comments

diptherio May 29, 2025 at 6:29 am

Proofreading note (burn after reading): The “What’s the problem with the likelihood function?” section is repeated twice.
1. Terry Flynn May 29, 2025 at 7:27 am
  
  Sorry. I thought I had edited that oit of orthogonal blog but maybe it didn’t take.
  1. Yves Smith Post authorMay 29, 2025 at 7:56 am
    
    I’ll clean it up here.
PlutoniumKun May 29, 2025 at 6:34 am

I read this on Terry’s blog and reading it again, it still makes my brain hurt! Its beyond my very limited stats knowledge to comment, but the general ‘learning’ principle holds through from my experience – a lot of statistical methods are utilised by people who don’t have a full understanding of what the figures are really telling them.
1. Terry Flynn May 29, 2025 at 7:58 am
  
  Yeah each time I teach it I gain new ideas as to give the intuition. The latest I’ve had is that if we observe numerical scale directly we see the magnitude and how far a point is from the line.
  
  In a binomial outcome all we see is 0 or 1. We use inferred curve (function) to decide where a success or failure is and how far away it it is spread. You can stretch or squash how quickly it asymptotes to [0,1] but the you can’t possibly do both. That’s why this time round I showed how the percentages from the function could be explained in multiple ways.
  
  Back in the days of just “left/right” you generally didn’t go far wrong in 2 dimensional political space 70 years ago but things but 3+ dimensions today and things go south quickly. It’s kinda like those YouTuber videos trying to explain 4 spatial dimensions.
  
  Another way to look at it is present two sets of 8 A vs B frequencies. One set has them clustered around 0.6 vs 0.4 and other 0.9 vs 0.1. They actually represent same preferences; type one just had more error in their choices. Look at the PATTERNS. Traditionally type two were younger and/or more educated but in politics (strong views, high consistency) these days such rules of thumb are increasingly dodgy. Underlying attitudes are increasingly dominating.
  1. Terry Flynn May 30, 2025 at 2:10 am
    
    I’ll try a few more thoughts in explanation as the brain fog is less pronounced first thing. Over the years there has been a lot of emphasis on “statistical odds” and odds ratios (ORs – the frequencies observed of one thing over another or others). This has unfortunately contributed to the lack of attention paid to the likelihood function, in which every “beta” (effect of each independent variable) is in fact beta divided by the standard deviation (or equivalently, multiplied by the inverse of this – the sigma). So we get “pseudo betas”.
    
    It was the academic marketers who first really as a discipline bagan the serious work which the math psych and transport economists had started. Just because We’re quoting ORs doesn’t mean we can ignore the two perfectly confounded variables that produce the OR in vulnerable sections of society. In fact it is often much easier to raises questions in people’s heads rather than change the true betas.
    
    In advertising you can increase the OR by either increasing the true beta (the true preference for, say a thinner phone) but if you can’t manipulate this, just increase lamda for thinness (or more likely, decrease lamda for everything else by bombarding individuals with stuff to increase their uncertainty about other features). When we talk about things like public scepticism about vaccines, this might be a larger problem merely because bad minded politicians have increased people’s uncertainty surrounding “any possible side effects of vaccines”. Result? Higher lamda for vaccine scepticism relative to other strategies and thence, high probability of refusal. Insidious. Betteridge’s Law is one nasty way of doing this kind of thing in reporting of news and op-eds.
    1. Steve H. May 30, 2025 at 10:04 am
      
      Both Upton Sinclair and Gödel lurk on the edges of these conversations. Odds Ratios were openly used for persuasion by the early ’90’s. (If you say probability of survival is 66%, they think it’s close to 50/50 and refuse the procedure, but if you say the odds are two-to-one, a successful procedure is more likely to occur.)
      
      Not That the Actual Hidden Variables are as Interesting as That There Are Hidden Variables. I’ve been using the logistic equation as a model for years, through ecology and environmental toxicology. I only recently learned the connection to Price’s Equation, where r is decomposed to Viability, and the covariants of selection. There’s great insight there, particularly with the covariance of confounders. But r is still valuable, in practice.
      
      There’s overlap with the post on the Constitution. Gödel said there’s a flaw in the Consitution, probably the amendment process violating consistency. What Gulag called ‘a shared horizon of meaning’ is a common understanding of the assumptions of the model. A dash of Lipsey-Lancaster, and you need an acceptable tolerance of failure. Value_at_Risk was useful, until it wasn’t.
      
      To this last point, certain model choices make me twitch. Eliminating negative values is one. Inversion is another, as it can convert a linear relationship to a geometric one. This alters perception, rather like mean v median, and the provenance of the choice is a hidden parameter. The phrasing in the first paragraph can be re-written, ‘but if you say the odds are two-to-one, you can pay off your boat.’
Samuel Conner May 29, 2025 at 7:01 am

The equation reminds me of the canonical partition function in statistical mechanics (with Lamba proportional to the inverse of the temperature of the system). I can see that that could not work well as a political prediction tool — not every voter is interacting with the same system at a known equilibrium “temperature”.

I suppose we can be glad that political pollsters are unwilling to invest in repeatedly re-polling the same survey respondents to estimate true variances (and, I suppose, how these true variances themselves might be time-variable in different ways for different people). That would make it even easier for political parties to manipulate the voting public.
1. Terry Flynn May 29, 2025 at 8:06 am
  
  Unfortunately they don’t even have to do that. Every time you hear simplistic statements that play to attitudes like MAGA or “stop the boats” that’s more likely boosting lamda and not offering the more nuanced beta representing a relative preference :(
  
  We don’t “do” nuance anymore.
Adam1 May 29, 2025 at 8:21 am

The focus on consistency is very intriguing relative to mRNA vaccine research. It was pointed out to me by a medical researcher that one of the concerning things he sees with this approach is that there is no way to control the antigen dosage of the recipient. You can control the mRNA dosage, but it is the body that then produces the antigen and that can vary from person to person and from each separate vaccination time.
1. GramSci May 29, 2025 at 9:50 am
  
  This takes me back to the old days when I was taught to prefer T-tests and ANOVAs with “repeated measures”. As I moved on to multivariate regression tricks, the repeated measures got lost somewhere…
  
  Insofar as political pollsters repeatedly sample the same population, they should be able to control this. Perhaps they do–in a sense–expressing it in a confidence interval?
  1. vao May 29, 2025 at 1:47 pm
    
    “Insofar as political pollsters repeatedly sample the same population, they should be able to control this.”
    
    Terry Flynn might want to comment, but I suspect that this repeated polling introduces another element — variability of the beta or lambda across time.
    1. Terry Flynn May 29, 2025 at 2:50 pm
      
      Yeah time variability is real. Ideally we’d identify variables known to be stable or conduct two studies together with overlapping variables. McFadden combined stated preference survey data with “revealed” preference data (actual traffic use data). This gave him lamda and hence genuine preferences
2. KLG May 29, 2025 at 5:06 pm
  
  An excursion into the weeds…25 years ago microarrays were all the rage in profiling gene expression patterns in breast cancer and other conditions. The red/green heatmaps were everywhere and they were very impressive at identifying genes involved in cancer progression. There was a lot of talk of a Nobel Prize for their originator and two of his collaborators. Except measuring mRNA levels is not the same as measuring levels of the cognate protein that is the actual worker in the cell. It also became clear after only a few years that reproducibility from day-to-day and lab-to-lab was a problem. One time the result was x, another y, another z. Biological systems are very large and stochastic, even if they are as small as a single cell or a single small tumor.
  
  The same problem exists with an mRNA vaccine. One dose of mRNA will not produce a consistent amount of cognate antigen every time in each patient. If the antigen is not well understood, it may have deleterious effects in and of itself. Still, my intuition, for what it may be worth, tells me that mRNA-based therapeutics will work when properly developed. Personalized cancer “vaccines” treating various cancers that are currently intractable have worked and are likely to mature as therapies. Expensive? Yes. So what? That all depends on what we want to spend our money on, a Golden Dome for Missile Defense or human health and well being.
  
  The scientist most responsible for microarrays is the founder of Impossible Foods.
  1. Terry Flynn May 30, 2025 at 2:47 am
    
    Thank you. Confounding is so serious because it is so often ignored, or the “notes section” to the logit/probit commands in Stata manual is an afterthought and generally is ignored. People are very wedded to the idea of a continuous outcome (like a happiness score or how much an attitudinal statement relates to you on a scale of 0-10). Whilst these CAN be plugged straight into linear regression and give you a mean effect (True beta) separate from an estimated variance, that doesn’t mean you SHOULD use them.
    
    Shalom Schwartz, in publishing his instructions on interpreting numbers from the attitudinal List of Human Values, used to hate that in an attempt to “clean the data” (is the suspicious dip in the number of ‘4’s among Chinese real? Is the preponderence of ‘7’ among Westerners real?) had to write what amounted to almost a textbook in its own right. My former colleagues at UWA, Perth Australia, successfully moved all his attitudinal statements into a discrete choice framework. You get subsets of attitudes and simply choose the one that best decribes you and the one that worst describes you*. No numbers BUT the mean-variance confound issue does arise**. My UWA colleagues, like me, eyeball frequencies and are good at spotting frequencies that “look suss” so 15 years of experience allows us to make quick and dirty jidgments off the bat as to whether a large frequency is “real” or because few people are “unsure” to admit to certain views. Schwartz was reportedly ecstatic when first shown this by my mentor and said he was throwing away his old valuation book (!)
    
    Unfortunately the moment you invent the knife, someone is going to work out they can stab people with it and that happened right under our noses. A friend applied for a job in IT at a bank. He came back, puzzled by the “psychometric test” which people like me learnt how to game in 1994. I asked him to explain what the task was. He got successive sets of 4 attitudes (things like “when in groups with divergent opinions I’ll take the lead” versus “I’ll be a consensus builder” plus 2 others etc). Being quantitative, he spotted that there was NO way to maximise both these quailties simultaneously – you could choose one of them up to 4 times across the 13 sets but then the other could be picked no more than 3. I told him “that’s not a psychometric test, that is a Balanced Incomplete Block Deign for a Best-Worst Scaling discrete choice experiment which my mentor invented. Quit worrying. Either they want to ensure they get a given mix of personality types – in which case there’s nothing you can do – or they want the borg – in which case there’s nothing you can do.”
    
    Told my boss the next day “who told bank x about our work?” He was angry (since we prided ourselves on knowing via exec education courses at that stage who was using our work. He said *I* certainly wouldn’t give it to them. So they sent someone from one of their affiliate companies to one of our courses to learn how to do it. It is, incidentally, an EXTREMELY efficient way of segmenting your sample in their views. So I learnt entirely by chance that our methods had already escaped into the wild and it was time to accelerate work on the textbook.
    
    * Yes the “DCE” can be boring but since the invention of online panels and personality quizzes on phones etc and what kind of data people willingly share, questionnaire length has ironically become less of a problem, provided you promise to feed back some real-time info at the end, which we perfected.
    **PS getting two datapoints, a “best” and a “worst” is only a partial solution but it can help.
Unironic Pangloss May 29, 2025 at 9:39 am

for anyone who wants to dive into the rabbit hole…. datacolada.org

a pro-“show me the data” website/blog about data science
1. Jokerstein May 29, 2025 at 10:21 am
  
  Sometimes there is more exciting stuff, like them being sued by prominent Harvard professor Francesca Gino when they exposed fraud in her research. Read the fascinating story.
  
  Of course, a celeb prof like Gino could be expected to wriggle and blame someone else – which she did.
  1. Jokerstein May 29, 2025 at 10:31 am
    
    Latest update: Gino was fired by Harvard on Tuesday for ethics violations. This is the first time in 80 years that Harvard has fired a tenured professor.
    
    Bottom line: don’t mess with data nerds: she would’ve gotten away with it if it wasn’t for those meddling kids. Or maybe just don’t falsify data?
  2. Unironic Pangloss May 29, 2025 at 10:46 am
    
    The most fascinating/concerning thing about Gino-Gate is that it was the most blatantly laziest form of data manipulation—akin to an eight year-old raiding a dark pantry, leaving a very muddy trail and just hoping that a light is never turned on.
    
    Gino is an ultra incompetent fluke and/or falsification is so rampant, peeps don’t put a modicum of thought into it; and the good ones do
tyaresun May 29, 2025 at 10:05 am

I worked on questionnaires in the 80s and I thought they would ask the same question multiple times in a few different ways to determine consistency. I don’t this see the approach these days. Maybe I am wrong and don’t fully understand this article.
1. tyaresun May 29, 2025 at 10:09 am
  
  Apologies for posting an AI summary by Google.
  
  Asking the same question in multiple ways, or rephrasing it, in questionnaires is a common practice used to improve the validity and reliability of survey results. This approach helps to ensure respondents understand the question, reduces bias, and can provide more robust data for analysis.
  Here’s a more detailed explanation of the benefits and reasons for rephrasing questions:
  1. Improved Understanding and Reduced Ambiguity:
  
  By presenting the same concept in different ways, researchers can ensure respondents have a clear understanding of what is being asked, even if they don’t grasp the initial phrasing.
  Variations in wording can help account for different interpretations and cognitive styles among respondents.
  This is especially important when dealing with complex or abstract concepts.
  
  2. Response Bias Reduction:
  
  People may answer questions differently depending on the wording, even if they have the same underlying opinion.
  Rephrasing can help identify and mitigate response biases, such as acquiescence bias (the tendency to agree with statements).
  By presenting the same concept in different ways, researchers can see if responses are consistent, regardless of the wording.
  
  3. Increased Reliability and Validity:
  
  When questions are rephrased and provide similar results, it increases the confidence in the reliability and validity of the data.
  Reliability refers to the consistency of the results over time, while validity refers to whether the questions are measuring what they are intended to measure.
  Rephrasing helps to establish internal consistency, meaning that the different versions of the question are measuring the same underlying construct.
  
  4. Detecting Inconsistent or Uncareful Responses:
  
  In some cases, rephrasing can help identify respondents who are not paying attention or are just randomly filling out the survey.
  If responses to different versions of the same question are inconsistent, it can raise a red flag about the validity of their answers.
  
  Examples of Rephrasing:
  
  Instead of just asking “Are you satisfied with your job?”, researchers might also ask “Do you feel fulfilled in your current position?” or “How happy are you with your work?”
  Instead of asking “Do you agree with the statement…”, researchers might also ask “To what extent do you support the idea of…?” or “How much would you favor…”.
  
  In summary, rephrasing questions in surveys is a valuable technique that can improve the quality and reliability of research data. It helps to ensure that respondents understand the questions, reduces the impact of biases, and allows for a more thorough assessment of the responses.
  1. Terry Flynn May 29, 2025 at 3:52 pm
    
    I may add more but fatigue etc is something that affects lamda and this is different from how to ask the question. Without bombarding people with too much more jargon, our tool was the discrete choice experiment. You’d use statistical design theory to tell you “to value these 8 items you must ask 16 questions with the 4 options per set presented as follows”.
    
    You simply ask the respondent to select the most preferred from each set of 4. You do NOT have one version of the questionnaire: you use a Latin Square design. So across your sample “question 1” is “set 1 for some people, set 2 for others……”. In fact you can estimate the survey purely from question 1….or purely from question 2 etc. When you look at choice frequencies from question 1 they’re usually closer to 25/25/25/25 than you would think. By about question 4 they’ve become far more skewed as people get the hang of the task (higher lamda). They start to revert closer to 1/4 around question 12 (fatigue or boredom).
    
    This move to discrete choices occurred in mid 1980s when key academic marketers and others realised the old “conjoint analysis” (rate this particular TV on 0 to 5 scale) was nonsense. You buy a TV. You don’t give the clerk a number! Math psych and some transport economists had warned them to quit using conjoint but warned that they’d be losing the (artificial and in practice never true) power of a continuous number scale and would be getting a different problem (the mean-variance confound instead). Too few listen to this day.
2. Adam1 May 29, 2025 at 2:50 pm
  
  I did a ton of survey research in the late ‘90s and early 2000’s, it was the golden age of survey research where you could use random dialing of the telephone to get great sampling. Inherent to that was the usage of professional interviewers who were trained to keep people engaged in the survey which allowed you to ask more questions (for a cost of course). In today’s world few surveys are done with trained interviewers so it’s hard to coax people along on a survey as survey fatigue sets in. The solution is to just keep the surveys short and sweat, but at the expense of making it harder to understand if your respondent is consistent in their beliefs or not (including that they are just making it up as they go).
  
  That said I think what Terry is referring to is more of consistency if you asked the same question on a different day. Most survey oriented “consistency” questioning is done to try and weed out respondents who are just bad respondents for one of many reasons (i.e. not really paying attention and or don’t really care about the survey questions). I believe Terry is referring to consistency longitudinally or more likely from moment to moment.
  
  As Terry said the standard assumption is that lambda is 1, but as Yves pointed out we all know about black swans (where distributions are not bell shaped & I’d assume the assumption that lambda=1 is an assumption that any variance to lambda in a population is bell shaped (but is it proven)). What if lambda is not 1? How far away from one can be meaningful, but who’s even trying to measure lambda and what are the consequences of that?
GrimUpNorth May 29, 2025 at 10:36 am

Wouldn’t the maths be easier if there was a “not voting for anyone” party?
1. Plant May 29, 2025 at 11:16 am
  
  I think yes and no.
  I haven’t worked out the details mathematically, but what I see happening is that there that it might help with tackling the part where we are restricted to the relative modelling as having a zero is helpful in terms of information that no pseudo-beta overcame the participation threshold. It could make sampling more complicated, but I do not have a lot of knowledge in that area.
  
  However, the highlighted issue of having ‘two unknowns and one equation’ would still remain.
  1. Terry Flynn May 29, 2025 at 3:13 pm
    
    Correct. Choose none is still a choice. [0,1] in the binary case has lost you key information about the individual. So (to answer the comment elsewhere) nobody reads the technical note in Stata manual that a latent (unobserved) scale is used to interpret (say 60 out of 100 successes) in a certain way when each of those 60 could have responded for any of an infinite number of reasons.
Plant May 29, 2025 at 11:06 am

I just graduated with a degree in Statistics (hoping to find work in this area — fingers crossed), but I wanted to chime in. In my view, this isn’t primarily a model issue, the model itself is fine. The real problem lies in how it’s commonly interpreted. It needs to be made clear to practitioners that assuming λ = 1 is arbitrary, not a normalizing step grounded in data. The analogy of “two equations for two unknowns” nails it. These models should be treated as two-step exercises, where the second step helps recover or adjust for the consistency (λ) if possible.

Honestly, I think a lot of the blame sits with software like Stata. A simple warning or note of caution when fitting these models could help users realize they’re only getting V × λ, not a clean estimate of either on its own. Even just a prompt encouraging people to think twice before interpreting the results as “true affinities” would go a long way. And maybe having a trained statistician review the results before decisions are made could act as a useful guardrail.

Just my take, hope it’s helpful!
1. Adam1 May 29, 2025 at 3:05 pm
  
  LOL!!! God bless you! In today’s world I suspect you have lots of opportunity ahead of you as a statistician.
  
  I think what Terry is invariable trying to point out isn’t the math or the statistics framework, but the results being reported by users of statistics who may have glossed over or forgotten ALL of those important assumptions that are baked into the output.
  
  Yes it would be nice if Stata or any stats package reminded us of ALL of those assumed assumptions, but invariably in the end do they get included in the final publications/report even if they are known or were reminded? In my experience the answer is NO because they raise too many questions for which we either have no definitive answer for (or were not paid sufficiently to include in the research) or because it becomes to theoretically complex for the client to understand and therefore might not be able to approve (pay for) the work.
Bazarov May 29, 2025 at 2:22 pm

This is excellent, Terry. I’m a complete layman, having not had any stats lessons in a very long time, but I had no trouble following your lesson and found it extremely interesting. I keep coming back to NC for arcana like this. You’re a wizard!
Unironic Pangloss May 29, 2025 at 2:23 pm

More of this kind of content, please—-whether from Terry or anyone else who wants an audience to “show how the sausage is made”.

Comments are closed.