# Sampling from a product of two Multivariate Gaussians

I have a multivariate Gaussian defined as follows:
\$\$
p(x) = omega(x)gamma(x)
\$\$
where \$omega\$ and \$gamma\$ both are multivariate Gaussians and from which I can sample very efficiently given due to special structure in their covariances: one is diagonal, of the other I know the diagonalization, which is very sparse.

I wonder if there is a way to sample from \$p\$, given samples from \$omega\$ and \$gamma\$. The standard way of doing a Cholesky on the joint covariance is not efficient enough.

# Dependent predictors with converse effects on the target

I am trying to create a predictive model for marketing in the natural gas field. The model is supposed to guess how probable it is to make a contract in that particular building given many internal and sociodemographic information. For sake of simplicity I tend to use decision trees. The target is a binary variable “contact won”.

Number of existing customers in the building have a strong effect on the target variable, as expected. If there are already some customers there, their neighbors are likelier to make a contract as well. I also expect that the number of problem tickets should have a negative effect. So I added the input variable ticket rate = # of problem tickets / # of customers.

The problem is, if there is no customers in the building, it is missing. So my model infer that the probability of a contract is high if the ticket rate is not missing. If I ignore the observations with the missing values, then I have too few observations. What would be the best strategy here?

# Good methods for density plots of non-negative variables in R?

plot(density(rexp(100))

Obviously all density to the left of zero represents bias.

I’m looking to summarize some data for non-statisticians, and I want to avoid questions about why non-negative data has density to the left of zero. The plots are for randomization checking; I want to show the distributions of variables by treatment and control groups. The distributions are often exponential-ish. Histograms are tricky for various reasons.

A quick google search gives me work by statisticians on non-negative kernels, e.g.: this.

But has any of it been implemented in R? Of implemented methods, are any of them “best” in some way for descriptive statistics?

EDIT: even if the from command can solve my current problem, it’d be nice to know whether anyone has implemented kernels based on literature on non-negative density estimation

# model average predictions in MuMIn

This code is taken from the MuMIn package:

# Example from Burnham and Anderson (2002), page 100:
data(Cement)
fm1 <- lm(y ~ X1 + X2 + X3 + X4, data = Cement)

ms1 <- dredge(fm1)
confset.95p <- get.models(ms1, subset = cumsum(weight) <= .95)
avgm <- model.avg(confset.95p)

nseq <- function(x, len = length(x)) seq(min(x, na.rm = TRUE),
max(x, na.rm=TRUE), length = len)

# New predictors: X1 along the range of original data, other
# variables held constant at their means
newdata <- as.data.frame(lapply(lapply(Cement[1:4], mean), rep, 25))
newdata\$X1 <- nseq(Cement\$X1, nrow(newdata))

n <- length(confset.95p)

# Predictions from each of the models in a set, and with averaged coefficients
pred <- data.frame(
model = sapply(confset.95p, predict, newdata = newdata),
averaged.subset = predict(avgm, newdata, full = FALSE),
averaged.full = predict(avgm, newdata, full = TRUE)
)

and can be used to calculate the model average predictions from a model. I try to apply this code to my data as:

dat2 <- dat
colnames(dat2) <- c("y","x1","x2","x3","x4")
require(mgcv)

fm1 <- gam(y ~ s(x1) + x2 + x3 + x4,
data = dat2, family = Gamma(link = "log"))

ms1 <- dredge(fm1)
confset.95p <- get.models(ms1, delta < 4)
avgm <- model.avg(confset.95p)

nseq <- function(x, len = length(x)) seq(min(x, na.rm = TRUE),
max(x, na.rm=TRUE), length = len)

# New predictors: X1 along the range of original data, other
# variables held constant at their means
newdata <- as.data.frame(lapply(lapply(dat2[2:5], mean), rep, 25))
newdata\$x1 <- nseq(dat2\$x1, nrow(newdata))

n <- length(confset.95p)

# Predictions from each of the models in a set, and with averaged coefficients
pred <- data.frame(
model = sapply(confset.95p, predict, newdata = newdata),
averaged.subset = predict(avgm, newdata,  se.fit = TRUE,
type = "link", backtransform = TRUE,
full = FALSE)
)

with the following data:

> dat2
y       x1       x2         x3    x4
1   1.5762030 6.403121 39.57020 19.7898699 263.0
4   0.6055921 6.724276 54.65370 57.6949810  69.0
5   0.8590718 5.000000 54.39610 52.7292424  45.0
6   3.5639570 4.026652 46.23046  0.6096747 483.0
7   1.0913056 5.278754 54.50630 73.3446956 161.0
10  0.8867792 5.378398 54.42850 60.8625755 252.0
11  0.3268744 5.968483 54.53120 85.2143789 100.0
12  1.2894085 5.348394 55.93400 12.2456428  28.0
14  1.1676040 4.973128 56.49910 52.2045777 153.0
15  0.6223934 6.672098 54.33810 78.6627861  44.0
16  0.4755159 5.602060 52.99930 58.2748252 454.0
17  0.9886397 5.413174 46.21046 44.1313399 512.0
18  0.4495683 5.531479 46.00200 68.3861409 502.0
19  4.8335367 3.742254 46.00760  3.1745636 503.0
20  1.6357879 4.755875 52.79470 75.4007517 285.0
23  1.6499405 5.230449 54.42720 50.1576069  53.0
24  0.5430869 6.477121 54.52220 74.8263568 113.0
25  0.3030517 7.374275 59.84600 67.0320046  11.0
26  1.0962220 6.000000 54.35830 57.1209064  66.0
27  0.4717264 6.522126 53.94300 38.2892886  14.0
28  0.9813975 5.146128 57.54240  0.8229747  40.0
32  1.3658448 5.041393 56.01810 83.5270211  82.0
33  1.0920776 5.806180 54.44840 67.7056874  62.0
34  1.4969232 5.001236 55.98500 10.0258844  50.0
37  1.0955896 6.180047 45.37960 52.7292424 327.0
40  4.1778522 3.895536 46.24369  0.3345965 516.0
41  0.9372727 7.784617 45.72400 61.2626394 185.0
42  8.3438712 3.392345 45.99453 14.9568619 499.0
43  0.6453018 8.212901 32.81700 60.6530660   0.0
44  1.7566555 4.707570 42.44085 58.8604970 274.0
45  1.6449275 4.707570 53.01450 70.9739596 380.0
46  0.5644657 6.024075 56.21300 57.4735034  96.0
47  0.8434520 4.968483 57.49210 57.4735034 236.0
48  0.8897043 6.046885 55.00210 44.9328964 214.0
49  0.5134841 4.995635 56.95900 82.7394490 788.0
50  1.0667866 5.045323 56.22730 64.6382635 418.0
51  0.3824371 7.851258 56.10500 58.8604970   7.6
52  0.6976290 5.778151 54.58250 60.6530660 121.0
54  0.7800599 7.582999 43.10500 44.0431655 259.0
56  2.7085846 4.769514 46.25745  7.2078462 506.0
57  3.5114572 4.577871 45.99800  8.7160851 494.0
58  0.8788488 6.033424 37.95440  2.7323722  33.0
59  6.4662790 3.665018 46.00500  7.8866400 497.0
61  0.8933495 5.986772 53.12800 61.8783392 316.0
62  5.6335367 4.000000 54.37210 23.6927759  66.0
63  0.6069901 5.698970 53.35400 42.7414932  27.0
64  0.6707361 7.539076 38.03700 58.2748252 279.0
65  0.3666130 7.902305 38.08200 53.2591801 280.0
66  1.2659074 5.103804 55.09330 67.7606062 298.0
68  0.4532456 7.296665 47.60000 71.1770323   9.0
69  1.5443280 4.550228 57.27850 54.8811636 540.0
70  0.8443011 5.002166 57.59120 87.8095431 600.0
71  2.2252419 4.984077 57.51680 85.2143789 670.0
73  1.3738172 5.492760 57.71260 77.8800783 570.0
74  2.0682796 4.877947 57.49550 48.6752256 520.0
75  3.7001075 4.729165 57.45240 69.7676326 770.0
76  2.2255645 5.334655 57.68130 77.8800783 540.0
77  1.2043548 4.748188 57.80980 67.7056874 690.0
78  1.0832527 5.160168 56.88500 42.7414932 790.0
80  1.0840667 4.633468 54.48120 84.3664817 602.0
83  0.4719416 7.173722 46.02900 67.7056874 495.0
84  0.7971067 5.791307 46.00800 67.7056874 497.0
88  0.6787417 7.210824 43.39100 67.7056874 333.0
89  0.5859315 7.612784 38.19800 92.3116346 298.0
90  0.9309800 6.612784 52.88850 55.9898367 164.0
92  6.6170833 4.020278 46.04100  5.5576213 495.0
93  0.8570780 6.949390 54.57610 72.6149037 145.0
94  0.4920395 6.371068 47.63900 71.8923733   5.0
95  0.4367456 7.732394 38.76800 87.8095431 600.0
96  0.6576827 7.944483 47.61800 81.8730753   0.0
97  2.4264054 4.687151 46.23562 20.6387461 513.0
98  0.5619723 6.826075 54.35000 63.7628152  39.0
99  1.1319117 6.113943 43.05370  4.0762204 257.0
100 1.8263940 5.214844 42.39800 27.2531793 275.0

I receive a warning message:

Warning message:
In predict.averaging(avgm, newdata, se.fit = TRUE, type = "link",  :
argument 'full' ignored

which basically means that the function is not computing the model average predictions from only the top set of models, but doing so from all models. How can I fix this? Does anyone know why this warning is appearing i.e. why can’t it do what I’m asking?

# Bayesian Updating Process, 3 signals and 2 states of the world

Suppose Nature chooses a state \$omega = {X,Y}\$ at \$t=0\$. Long-lived agents observe a signal \$s_t\$ at every period \$t\$, where \$s_t = { x,y,z }\$. Agents all hold a common prior \$mu_0 in (0,1)\$ when starting their life.

Signals are independent over time. Furthermore, suppose a signal \$z\$ is observed with probability \$alpha in [0,1)\$ independent of the state \$omega\$, and signals \$x\$ or \$y\$ are observed with probability \$1-alpha\$. Finally, let the probability to observe a signal \$x\$ (\$y\$, respectively) when the state is \$X\$ (\$Y\$, resp.) be \$beta > frac{1}{2}\$.

QUESTION: If a Bayesian agent observes \$z\$ at \$t\$, how will she update her posterior, assuming Nature chose the state \$X\$ ?

For facilitating computations, say that the posterior of the agent at \$t-1\$ is \$mu_{t-1} = frac{3}{4}\$, \$alpha = frac{1}{4}\$ and \$beta = frac{2}{3}\$.

# Null hypothesis of probit model Wald test

Say I estimate the following probit model:

\$\$ins = Phi(alpha + beta_1 age + beta_2 educ + beta_3 hg + beta_4chronic + beta_5 hisp + beta_6 lin) + u\$\$

where:

\$ins = 1\$ for any individual who has private health insurance, \$0\$ otherwise.

age = age in years.

educ = years of schooling.

\$hg=1\$ if health status self-assessed as good, 0 otherwise.

chronic = number of chronic conditions an individual has.

hisp\$=1\$ if Hispanic, 0 otherwise.

lin = natural log of household income.

I was if I were to perform a Wald test that for an individual with otherwise
median characteristics, the marginal effect of age is unaffected by the number of
chronic diseases the individual has, how do I derive the null hypothesis? (P.S., I can conduct the wald test quite easily in STATA, but my question is how do I state the null hypothesis?).

# how one independent variable affects other variable in multiple regression

I have a two independent predictors in regression equation “X1″ and “X2″. X2 is categorical hence 1,0 is the allowed values. I have to find how X1 get affected based on value in X2 in following regression equation

\$\$ Y = 101.00+(0.40)*X1+(0.10)*X2\$\$

I have SPSS coefficient table in which

\$\$ R^2 = 0.65 \$\$

X1

• zero order correlation – 0.81
• Sq Partial correlation -0.66 //if I keep X2 constant then this is the contribution in Y Part correlation
• 0.619 //it’s unique contribution

X2

• zero order correlation – 0.24
• Sq Partial correlation 0.064 //if I keep X1 constant then this is the contribution in Y Part correlation
• 0.022 //it’s unique contribution

following some youtube if I deduct “sum of part” from R-sqr then I will get the overlap region but how would I interpret that region? How can I derive any relationship between X1 Vs X2 from given data?

# Unbalanced groups in a Hierarchical Linear Model

I would appreciate some practical and/or conceptual advice on sample sizes in a two level hierarchical linear model.

A lot of the material on sample sizes and HLM is about the number of level 2 groups. I am more interested in sample sizes for level 1.

The dataset I am working with is about building developments (level 1) that are situated in a Statistical Area One (SA1) (level 2). An SA1 is a geographical area used by the Australian Bureau of Statistics.

The nominal sample sizes are
- 6,861 SA1s
- 54,295 building redevelopments

The aim of the study is to understand why building redevelopments with certain dwelling yields occur.

I’m using an existing (secondary) datasetfor my study. As far as I can see , my main alternatives are to sample from the dataset in some way, or change the level of aggregation (ie use SA1 as level 1 and use SA2 as level 2, etc).

On average, there are 7.9 redevelopments per SA1; however the distribution of redevelopments amongst SA1s is highly skewed. The number of redevelopments per SA1 range from 1 to 895. The median number of redevelopments is 4 (and 1 is the mode, with 1,506 SA1s having one redevelopment.)

The relatively few SA1s with high numbers are what the study is all about, and I’m contrasting these SA1s with SA1s with low numbers of redevelopments to understand how the SA1 characteristics differ. As the SA1 are a geographical unit, I’m also accounting for spatial constraints.

From reading on the subject, most of the comment is about the size of the level 2 sample, and generally it is assumed that the level 1 samples are balanced.

The questions I’d like to ask are:

• What is the impact on sample size / power if the level one and level two units are coterminous to the extent they are in this dataset – where the SA 1 has only one redevelopment ? Bickel is the only author who commented on this possiblity (p 272), but no advice or comment, apart from suggesting it is not good.
• Is there a minimum level 1 sample size that should be followed?
• How can I best address the severely unbalanced groups (with groups having 1 to 895 redevelopments). The only comment I could find on this was by Browne, who ran a simulation test on balanced, unbalanced and severely unbalanced level 1 samples, and found “strange behaviour” for the unbalanced samples. He commented that the unbalanced designs are really estimating the effects of the large group instead of the global mean.

Generally, how can one best approach a severely unbalanced hierarchical dataset such as this.

References

Robert Bickel, Multilevel Analysis for Applied Research, 2007, The Guilford Press.

William Browne, powerpoint slides: “Sample Size calculations in multilevel modeling”

# how to estimate heterogeneous effects?

I have a dataset where I run regression discontinuity with the following code:

xi: reg work post i.post|m i.post|m2 age age2 age3 immig primary hsgrad univ sib pleave i.q if m>-10 & m<9, robust

How where primary hsgrad univ are regressors referring to the educational status. How can I do the same regression but only for subjects that have university completed? (univ is a dummy variable==1 if subject has university). Do I need to remove the other regressors referring to education from the equation?

# Practical advantage of requiring VPN for SSH

I have some VPS servers which I manage through ssh. However there are many ports which I want open during development, therefore I also have OpenVPN on them.

Now I am wondering if it would make sense requiring a OpenVPN connection for ssh access. Is there any practical security advantage on a ssh connection through a VPN (same server) vs a direct ssh access?