acausal Discovery With Continuous Additive Noise ModelsaÂ

1 Introduction

Understanding causal relationships is a fundamental problem in various disciplines of science, and causal direction identification is an essential issue in causality studies. It is well known that using randomized experiments to identify causal influences usually encounters unethical or substantial expense issues. Fortunately, inferring causal relations from pure observations, also known as causal discovery from observational data, has demonstrated its power in empirical studies and has been a focus in causality research.

Various methods have been proposed to infer the causal direction, by exploring properly constrained forms of functional causal models (FCMs). A functional causal model represents the effect as a function of its direct causes and independent noise, i.e., . Without constraints on , then for any two variables one can always express one of them as a function of the other and independent noise [Zhang et al.2015]. However, it is interesting to note that with properly constrained FCMs, the causal direction between and is identifiable because the independence condition between the noise and cause holds only for the true causal direction and is violated for the wrong direction. Such FCMs include the Linear, Non-Gaussian, Acyclic Model (LiNGAM)[Shimizu et al.2006], in which with linear coefficients , the nonlinear additive noise model (ANM)[Hoyer et al.2009], in which , and the post-nonlinear (PNL) causal model[Zhang and Hyvärinen2009], which also considers possible nonlinear sensor or measurement distortion in the causal process: . It has been shown that in the generic case, for data generated by the above FCMs, the reverse direction will not admit the same FCM class with independent noise. One can then find causal direction by estimating the FCM followed by testing for independence between the hypothetical cause and estimated noise [Hoyer et al.2009, Zhang and Hyvärinen2009].

In reality, we can usually record only a subset of all variable which are causally related. If some variable is the direct cause of only one measured variable and is not measured, it is considered as part of the omitted factors, or noise. If a hidden variable is a direct cause of two measured variables, it is a confounder, and causal discovery in the presence of confounders is challenging, although there exist some methods with asymptotic correctness guarantees, such as the FCI algorithm [Spirtes et al.2000]. In this paper, we are concerned with unmeasured intermediate causal variables. Suppose , with unmeasured, and that each direct causal influence can be represented by a FCM in a certain class. If the direct causal relations are linear with additive noise, then the causal influence still follows a linear model with additive noise. However, if each direct causal influence follows the ANM, the causal influence does not necessarily follow the same model class. Fig.1 gives an illustration of this phenomenon of "non-transitivity of nonlinear causal model classes," in which , and , with , , and

mutually independent and following the uniform distribution between

and . As seen from the heterogeneity of the noise in relative to , given in Fig.1(c), the causal influence from to clearly does not admit a nonlinear model with additive noise. Hence, even for the correct causal direction, which is from to , the independent noise condition is violated, and existing methods for causal direction determination by checking whether regression residual is independent from the hypothetical cause may fail. The PNL is more general than the additive noise model – in this example, if is zero, then will follow this model. However, the PNL model class is also non-transitive.

Figure 1: Illustration of non-transitivity of nonlinear causal model classes, in which and each direct causal influence follows a nonlinear model with additive noise. Panels (a), (b), and (c) show the scatter plot of and , that of and , and that of and , respectively.

Figure 2: Illustration of the CANM, where the causal chain from to consists of three unmeasured intermediate variables with their associated noises .

This paper deals with such indirect, nonlinear causal relations, which seem to be ubiquitous in practice. Finding causal direction for such causal relations has recently been posed as an open problem [Spirtes and Zhang2016]. In particular, we aim to find the causal direction between and that are generated according the process given in Fig. 2, in which there might be a number of unmeasured intermediate causal variables in between and each direct causal influence, e.g., the influence from and on , follows the ANM. We name the causal model from and given in Fig. 2 a Cascade Additive Noise Model (CANM). We note that the considered problem is different from causal discovery in the presence of confounders, for which there have been a number of studies, including the FCI [Spirtes et al.2000], RFCI [Colombo et al.2012], M3B [Yu et al.2018] algorithms, and methods relying on stronger assumptions [Janzing et al.2009, Zhang et al.2010]. [Kocaoglu et al.2018] propose an algorithm to search for the latent variable along the path and

but they only consider discrete random variables.

To the best of our knowledge, this is a first study as to finding causal direction between indirectly and nonlinearly related variables. The considered causal model can be seen as a cascade of processes, each of which follows the ANM, and the intermediate variables are unmeasured. Intuitively, the independence between the noise and cause is still helpful in finding causal direction–the wrong direction will not follow the independence noise condition in the generic case, allowing us to correctly identify causal direction. This will be supported by our theoretical studies and empirical results in subsequent sections.

2 Cascade Additive Noise Model

Without loss of generality, let be the cause of effect ( ), with unmeasured intermediate variables between them, as shown in Fig.2. We further assume there is no confounder in the mechanism and the data generation follows the nonlinear additive noise assumption. Then, such an indirect causal mechanism can be formalized by the CANM in the following definition.

Definition 1 .

A CANM for cause and effect is that there exists a sequence of unmeasured intermediate variables between and such that no variable in the latter is the cause of the former one:

where , , and are mutually independent, denotes depth of the chain, and denote parents of the and , respectively. To ensure the cascade structure, the causal relations among are recursive. Let and denote a set of nonlinear functions and the corresponding additive noises at each depth in the chain, respectively. Naturally, here the direct cause and the noises are independent from each other.

We are given a set of data . Let be the parameters of the causal mechanism. Combing all the independence relations of CANM, we can derive its marginal log-likelihood as follows:

Eq. 2 first decomposes the joint likelihood based on the Markov condition [Spirtes et al.2000], then applies the independence property between the cause and the noise in the second equality, i.e., . At the same time, we replace with and rewrite function as , because the last unobserved direct cause contains all the information of the noise and cause relative to .

In the above derivation, we used the transformation from and noises to . The property of the transformation helps study identifiability and find a practical solution. In light of the independence property of the noises, below we propose a variational approach to approximating the marginal log-likelihood as well as identifying the causal direction.

2.1 Variational Solution of CANM

The variational solution to estimation of CANM consists of two steps. First, we take advantage of the independence property in CANM to replace the latent variable with . Second, we find an amortized inference distribution with respect to the parameter to approximate the intractable posterior and jointly optimize a variational lower bound of the marginal log-likelihood. Note that, different from the vanilla VAE, can be seen as a function of and and, as a result, is a function of X and Y and we need to recover from both and . According to Eq. 2 , the (log) marginal likelihood, as the sum over of the marginal likelihoods over individual data points:

where be the lower bound at data point , resulting from approximating an intractable posterior by . Under the framework of CANM, the lower bound of the total marginal likelihood can be further estimated as follows:

The details of derivation can be found in Supplementary A. As shown in Eq. 3 , the lower bound is tight at . That is, when , the lower bound is equal to the marginal log-likelihood. Below we will maximize the variational lower bound.

Here, we assume the distributions of noise can be factorized as . Note that if is an empty set, the above lower bound is equivalent to the log-likelihood of the standard additive noise model.

2.2 Variational Auto-encoder

Figure 3: Toy Example for CANM Variational Auto-encoder.

The design of the variational auto-encoder (VAE) generally follows the typical configuration in [Kingma and Welling2014]. We denote as encoder and as decoder

, using a multilayer perceptron (MLP) as an universal approximator for this two functions.

In the encoder phase, the noises of CANM are inferred by an encoder network with a reparameterization trick. That is, reparameterize the random variable with a differentiable transformation such that with . Then the expectation in the lower bound can be estimated using Monte Carlo with the reparameterization trick over samples.

In the decoder phase, we estimate by calculating the difference between the sample and the reconstruction from decoder , where . Finally, through the alternating processing on the encoder and decoder phases, we can optimize the lower bound until it converges.

Fig. 3 shows a toy example of the structure of the CANM variational auto-encoder with , where and are deterministic function with parameter . In the encoder phase, we encode the samples into the noises using a reparameterization trick where . In the decoder phase, the sample is reconstructed by the decoder .

2.3 Practical Algorithm

Finally, we propose a general principle that makes use of the VAE to estimate the marginal log-likelihood as well as identify the causal direction.

1: Data samples .

2: The causal direction.

3: Split the data into training and test sets;

4: Choose the best number of latent variables by optimizing the variational lower bound (Eq. 3 ) on the training set and evaluating the performance on the test set;

5: Optimize the lower bound in both directions with the best number of latent variables on the full dataset, obtaining and (see Eq. 4), respectively.

6: if , where is a pre-asigned small positive number,then,

7: Infer

8: elseif then

9: Infer

10: else

11: Non-identifiable

12: endif

Algorithm 1 Inferring causal direction with CANM

Algorithm 1 consists of two phases; the first is model selection, selecting the best number of latent noises, and the second is to identify the causal direction. In phase 1, by splitting the data into training and testing sets, the best number of noises is selected based on the performance on the test set (Line 1-2). In phase 2, we use the number of the latent noises determined in phase 1 to optimize the variational lower bound on the full dataset and then identify causal direction according to the likelihood for both directions (Line 3-10).

3 Identifiability

In this section, we investigate whether there exist any CANMs whose generated data also admit a CANM in the reverse (anti-causal) direction. In the following theorem, we propose a way to derive the noise distribution for the reverse direction

by making use of the theory of Fourier transform

[Bracewell and Bracewell1986]. The causal direction is unidentifiable according to the CANM if is independent from and (i.e., the marginal likelihoods for both directions are equal).

Theorem 1 .

Let follow the cascade additive noise model, while there exists a backward model following the same form, i.e.

then the noise distribution of the reverse direction must be

where denote the function implied by the cascade process.

Proof.

See Supplementary B for a proof. ∎

Roughly speaking, regardless of the linear case, Theorem 1 implies that the noise distribution in the reverse direction is generally coherent with . To ensure such noise is independent from , one strict condition must holds, i.e., should be independent from in the sense that . However, in general, it seems that such a condition holds only in restrictive cases. Therefore, in most cases, after the latent noise is recovered, we can identify the causal direction by using the independence property for .

To further illustrate the implication of Theorem 1, we provide two special cases in the following corollaries. In Corollary 1, we show that CANM is unidentifiable if the generation process is linear Gaussian. In Corollary 2, we show the connection with ANM when there is no unmeasured intermediate variables, and shows a generic choices of , , and for the identification of the model. Those two special cases are consistent with the previous results.

Corollary 1 .

Assume that CANM is linear Gaussian, i.e.,

where , then their exist a backward CANM

where and is independent of and .

Proof.

See Supplementary C for a proof. ∎

Corollary 2 .

Suppose that there is no unmeasured intermediate noises in CANM, if the solution of Eq. 6 exists, then the triple must satisfy the differential equation from ANM [Hoyer et al.2009, Theorem 1] for all with :

where

Proof.

See Supplementary D for a proof. ∎

4 Experiments

4.1 Synthetic Data

In this section, we design three experiments with known ground truth, with the depth , sample size , and with different sample sizes for some fix structures. The default setting is marked in bold. All the experimental results are averaged over 1000 random generated causal pairs generated by the cascade additive noise model. Code for CANM is available online ¹ ¹ 1https://github.com/DMIRLAB-Group/CANM .

To make the synthetic data general enough, in each depth, we randomly generate an additive noise model and then stack it together to obtain the cascade additive noise model. In detail, the cause

is sampled from a random Gaussian Mixture model of 3 components

where . For each layer where and

is generated from a cubic spline interpolation using a 6-dimensional grid from

to as input with respect to 6 random generated points as knots for the interpolation; the generated points are sampled from and the number of knots is used to control non-linearity of the function. Such generative process follows the instrument given in [Prestwich et al.2016].

The following four algorithms are taken as baseline methods: ANM [Hoyer et al.2009], CAM [Bühlmann et al.2014], IGCI [Janzing et al.2012], and LiNGAM [Shimizu et al.2006]

. We also improve the implementation for ANM by using the XGBoost

[Chen and Guestrin2016] for regression and the Hilbert-Schmidt independence criterion (HSIC) [Gretton et al.2008] as the independence test. Therefore, ANM can be evaluated in two ways. First, we compare the HSIC statistic to determine the direction and second, we select the best significance level ( ) range from 0.01 to 1 to determine the causal direction. At the same time, the best parameter setting of IGCI is chosen. For the other baseline methods, we use the parameter settings in their original papers. The implementation and the parameter settings of LiNGAM and CAM are based on the CompareCausalNetworks packages in R [Heinze-Deml et al.2018].

Sensitivity to Depth: Fig. 4 shows the accuracy with different depths in 3000 samples. Firstly, when the depth is equal to 0 (the original additive noise model), all CANM, ANM, and CAM achieve a high accuracy. Note that CANM still has a similar performance comparing with ANM even though CANM assume that there might exist unmeasured intermediate variables, which demonstrates the robustness of our method. Secondly, as the depth increases, the accuracy of CANM is stable and around 90% accuracy with a slight decrease, while the performance of the rest methods decreases rapidly as the depth grows. In particular, the ANM with the significance level of 0.01 gives almost random decisions when the cascade structure exists.

Sensitivity to Sample Size: Fig. 5 shows the accuracy with different sample sizes while the depth is fixed at 3. The result shows that even in the small sample size, CANM still outperforms the other methods. As the sample size increases, the accuracy of CANM grows faster than the other methods. Thus, large samples are beneficial to CANM, because of the variational auto-encoder framework employed in CANM. A similar result also can be observed in ANM and CAM while the other methods are less sensitive to the sample size due to the model restriction.

Figure 6: Sensitivity to Sample in a Fixed Structure.

Sensitivity to Sample Size in a Fixed Structure: Fig. 6

shows the accuracy with different numbers of samples while we use a fixed causal mechanism, which was randomly generated with depth=3. When the sample size is small, the variance of the likelihood is large; however, the asymmetry in the causal direction is still clear. As the sample size increases, the variance of the likelihood decreases and the accuracy increases, which implies the effectiveness and robustness of CANM as the sample size grows.

4.2 Real World Data

Electricity consumption: The electricity consumption dataset [Prestwich et al.2016] has 9504-hour measurements from the energy industry, containing the , outside and the on the power station. The causal mechanism among the three variables is and . The first pair is generally caused by the heating of sunlight and the second pair is base on the fact that the usage of heating or air condition depends on the temperature. We are interested to know whether we can identify the is the cause of the and what intermediate variable will be inferred via CANM.

Figure 7: Hour of Day Against Electricity Load.

Figure 8: Temperature Against Fitted Intermediate Variable.

In general, we successfully identify the correct causal direction with average score while ANM fails on this pair (the p-value on both directions). The prediction of is given in Fig. 7. It is interesting to note that there might exist more than one unmeasured variable, e.g., season, causing a different electricity load at the same hour of day. Such unmeasured variables are successfully captured by CANM as the prediction separating into both upper and lower parts. Furthermore, the intermediate variable inferred by our method has rather high correlation ( ) with the temperature as shown in Fig. 8, which means that CANM not only recovers the information of the season but also the information of the temperature.

Stock Market: The stock market dataset is collected by Tübingen causal effect benchmark (https://webdav.tuebingen.mpg.de/cause-effect/) as pairs 66-67. It contains the stock return of , and with the causal relationship: and . The reason for the first pair is that Cheung Kong owns about 50% of Hutchison. For the second pair, Sun Hung Kai Prop., a typical stock in the Hang Seng Property subindex, is believed to depend on the major stock Cheung Kong. Similarly to the previous experiment, we are interested to know whether we can identify the is the cause of the .

Figure 9: Stock return of Hutchison Against Stock return of Sun Hung Kai Prop.

Figure 10: Stock return of Cheung kong Against Fitted Intermediate Variable.

Since these three stocks form a causal chain that , using CANM, we successfully identify the indirect causal direction with average score while ANM fails on this pair (the p-value on the causal direction and p-value on the reverse direction). Fig. 9 shows the prediction of the stock return of the . We also find that the fitted intermediate variable has a high correction ( ) with the stock return of as shown in Fig. 10.

5 Conclusion

In this paper, we proposed the cascade nonlinear additive noise model, as an extension of the nonlinear additive noise model, to represent indirect causal influences, which result from unmeasured intermediate causal variables. We have demonstrated that, the independence between the noise and cause is still generally helpful to determine causal direction between two variables, as long as the cascade additive noise process holds. We propose to estimate the model as well as the intermediate causal variables with the variational auto-encoder framework, and the produced likelihood indicates the asymmetry between cause and effect. As supported by our theoretical and empirical results, the proposed approach provides an effective method for causal direction determination from data generated by nonlinear, indirect causal relations.

Acknowledgments

This research was supported in part by NSFC-Guangdong Joint Found (U1501254), Natural Science Foundation of China (61876043), Natural Science Foundation of Guangdong (2014A030306004, 2014A030308008), Guangdong High-level Personnel of Special Support Program (2015TQ01X140) and Pearl River S&T Nova Program of Guangzhou (201610010101). KZ would like to acknowledge the support by National Institutes of Health (NIH) under Contract No. NIH-1R01EB022858-01, FAINR01EB022858, NIH-1R01LM012087, NIH-5U54HG008540-02, and FAIN- U54HG008540, by the United States Air Force under Contract No. FA8650-17-C-7715, and by National Science Foundation (NSF) EAGER Grant No. IIS-1829681. The NIH, the U.S. Air Force, and the NSF are not responsible for the views reported here. We appreciate the comments from anonymous reviewers, which greatly helped to improve the paper.

References

[Bracewell and Bracewell1986] Ronald Newbold Bracewell and Ronald N Bracewell. The Fourier transform and its applications, volume 31999. McGraw-Hill New York, 1986.
[Bühlmann et al.2014] Peter Bühlmann, Jonas Peters, Jan Ernest, et al. Cam: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526–2556, 2014.
[Chen and Guestrin2016] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In SIGKDD, pages 785–794, 2016.
[Colombo et al.2012] Diego Colombo, Marloes H Maathuis, Markus Kalisch, and Thomas S Richardson. Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, pages 294–321, 2012.
[Gretton et al.2008] Arthur Gretton, Kenji Fukumizu, Choon H Teo, Le Song, Bernhard Schölkopf, and Alex J Smola. A kernel statistical test of independence. In Advances in neural information processing systems, pages 585–592, 2008.
[Heinze-Deml et al.2018] Christina Heinze-Deml, Marloes H. Maathius, and Nicolai Meinshausen. Causal structure learning. Annual Review of Statistics and Its Application, 8, 2018.
[Hoyer et al.2009] Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models. In NIPS, pages 689–696, 2009.
[Janzing et al.2009] Dominik Janzing, Jonas Peters, Joris Mooij, and Bernhard Schölkopf. Identifying confounders using additive noise models. In
Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
, pages 249–257. AUAI Press, 2009.
[Janzing et al.2012] Dominik Janzing, Joris Mooij, Kun Zhang, Jan Lemeire, Jakob Zscheischler, Povilas Daniušis, Bastian Steudel, and Bernhard Schölkopf. Information-geometric approach to inferring causal directions. Artificial Intelligence, 182:1–31, 2012.
[Kingma and Welling2014] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR 2014), April 2014.
[Kocaoglu et al.2018] Murat Kocaoglu, Sanjay Shakkottai, Alexandros G Dimakis, Constantine Caramanis, and Sriram Vishwanath. Entropic latent variable discovery. arXiv preprint arXiv:1807.10399, 2018.
[Prestwich et al.2016] SD Prestwich, SA Tarim, and I Ozkan. Causal discovery by randomness test. In Proceedings of the 14th International Symposium on Artificial Intelligence and Mathematics, 2016.
[Shimizu et al.2006] Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, and Antti Kerminen. A linear non-gaussian acyclic model for causal discovery.
Journal of Machine Learning Research
, 7(Oct):2003–2030, 2006.
[Spirtes and Zhang2016] Peter Spirtes and Kun Zhang. Causal discovery and inference: concepts and recent methodological advances. Applied Informatics, 3(1):3, Feb 2016.
[Spirtes et al.2000] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000.
[Yu et al.2018] K. Yu, L. Liu, J. Li, and H. Chen. Mining markov blankets without causal sufficiency.
IEEE Transactions on Neural Networks and Learning Systems
, 29(12):6333–6347, Dec 2018.
[Zhang and Hyvärinen2009] Kun Zhang and Aapo Hyvärinen. On the identifiability of the post-nonlinear causal model. In UAI, pages 647–655, 2009.
[Zhang et al.2010] K. Zhang, B. Schölkopf, and D. Janzing. Invariant gaussian process latent variable models and application in causal discovery. In 26th Conference on Uncertainty in Artificial Intelligence (UAI 2010), pages 717–724. AUAI Press, 2010.
[Zhang et al.2015] Kun Zhang, Zhikun Wang, Jiji Zhang, and Bernhard Schölkopf. On estimation of functional causal models: General results and application to the post-nonlinear causal model. ACM Trans. Intell. Syst. Technol., 7(2):13:1–13:22, December 2015.

Supplementary Material

A The Lower Bound for Cascade Nonlinear Additive Noise Model

B Proof of Theorem 1

Theorem 2 .

Let follow the cascade additive noise model, while there exists a backward model following the same form, i.e.

then the noise distribution of the reverse direction must be

where denote the function implied by the cascade process.

Sketch of Proof: Based on the derivation of the marginal log-likelihood at Eq. 2 in Section 2, if Eq. S.1 holds, we have

Applying Fourier transform to , we obtain

Since , we have . By making use of the convolution theorem, the above equation can be rewritten as follows,

Combing Eq. S.4 and S.5, we have

Then, applying the inverse Fourier transform, we conclude

Based on Bayes' theorem,

, and we further have

∎

C Proof of Corollary 1

Corollary 3 .

Assume that CANM is linear Gaussian, i.e.,

where , then their exist a backward CANM

where and is independent of and .

Proof.

Based on Theorem 1, the noise distribution on the reverse direction can be expressed as

Based on the Bayes' theorem, , where is the distribution of the . Without loss of generosity, let , we have

The following derivation using the fact that the Fourier transform of the Gaussian distribution is

then we have

Let , we obtain

Thus, we have , which is a Gaussian distribution and independent of . ∎

D Proof of Corollary 2

Corollary 4 .

where

Proof.

Since no unobserved intermediate noises, based on Theorem 1, we have

Let , then based on the Fourier inverse theorem, the existence of the solution of Eq. S.10 is equivalent to the existence of following equation,

which is the standard identifiability for additive noise model, then applying the [Hoyer et al.2009, Theorem 1], the triple must satisfy the following differential equation for all with :

∎

preussfornevenithe.blogspot.com

Source: https://deepai.org/publication/causal-discovery-with-cascade-nonlinear-additive-noise-models

acausal Discovery With Continuous Additive Noise ModelsaÂ

1 Introduction