Abstract

Data analysis for a sample of celestial bodies generally is preceded by the completeness test in order to verify whether the sample objects are proper representatives of the corresponding part of the universe. A data set following a multivariate, continuous, uniform distribution is said to be “complete in space.” This paper introduces a new approach to check for this completeness for any astronomical data set under a multivariate setup. Our proposed procedure, using the multiple tests of hypotheses based on nonparametric statistics, and consequently, combining their values, outperforms others from the literature.

1. Introduction

In astronomy, different catalogs from various sources are generally combined to create a master data set, where it is supposed to be “complete in space.” Under the univariate setup, a sample related to a particular astronomical parameter (variable), original or transformed [1], is uniform in distribution (continuous), then it is referred to as “complete in space.” In this context, the popular test [1] has been proposed. However, it is restricted to univariate analysis; therefore, it cannot take into account the multivariate structure of the samples, and it provides only a point estimation of the concerned statistics. On the other hand, the other implemented statistical tests in astronomy, to check for uniformity of a multidimensional sample, involve comparison of each individual variable with a univariate uniform distribution, irrespective of the important dependence structure underlying the multivariate sample under analysis [2, 3]. Therefore, in this paper, we propose a new approach to investigate the completeness of a multivariate data set in space. To our very knowledge, it is the first multivariate test of hypothesis to check completeness for an astronomical sample. We discuss two nonparametric tests [4, 5] to check whether the data set follows a multivariate uniform distribution over the range (denoted by ) or not, where is the dimension of data set for and is the set of the -times Cartesian product of the closed interval [0, 1]. Any deviation of the given sample from will lead to the rejection of the fact that the data set is “complete in space.”

Establishment of a test to check becomes difficult for higher values of , whereas the existing tests are either not well defined or not feasible for big [6, 7]. In the literature, the multiple tests for goodness-of-fit are presented for checking the null hypothesis if a sample is from a specific multivariate distribution. However, only few of them are put forward for the multivariate uniformity of the data set. Two popular tests, among them, are (i) the multivariate Kolmogorov–Smirnov test [8] and (ii) the test based on empirical characteristic function [9, 10]. The empirical distribution function has jumps and discontinuity at various points apart from the sample observations, which makes it quite challenging to be computed for large . Therefore, the algorithm for the concerned test statistic is yet unavailable for [6, 8]. Any distribution is characterized by its characteristic function, which is consistently estimated by an empirical version. However, computation of the test statistics and the critical value for the test, based on the empirical characteristic function, is very difficult for high-dimensional sample as well as for big data, which induce greater testing error [911]. Avoiding all these concerns, we propose a novel approach as follows.

In this work of Astrostatistic, we suggest a new testing procedure based on multiple nonparametric tests of hypotheses, where we check whether the individual marginal of the data set is from a univariate, continuous, uniform distribution over the range [0, 1] (denoted by ) or not. Here, we use the fact that if the given multivariate sample follows , then all the marginals of the data set will be from and vice versa. Our final decision is taken uniquely by properly combining the dependent multiple tests or their corresponding values. With advanced fashion of data collection, we focus on the high-dimensional big data from astronomical field (see, [5, 12, 13], and references therein), where our data study shows that the proposed technique is effective and superior compared to its competitors.

This paper is organized as follows. Two proposed methods are described in Section 2. The simulation is carried out in Section 3. Section 4 holds application of our proposed tests to an astronomical data set. Finally, Section 5 concludes the paper.

2. Proposed Method

Our main objective is to investigate the completeness of a multivariate sample in space, which is done in terms of hypothesis testing. Suppose is a real-valued-variate observation vector and we want to test whether it follows or not, that is, we test the null.where ‘’ is used to mean following and ‘’ not following. We perform our test using the given sample:

with size .

2.1. Multiple Tests

The abovementioned proposed hypothesis testing can be equivalently performed in terms of the following number of multiple tests, which are carried out in a univariate setup for each variable. Here, we implement the fact that if the given multivariate sample follows , then all the marginals of the data set will be from and vice versa. The dependent multiple tests are formulated as follows:

Then, each of the univariate multiple tests is done with the help of the popular nonparametric one-sample tests: (t1) Kolmogorov–Smirnov test [14] and (t2) Anderson–Darling test [15], to check whether the given sample for each dimension follows or not. Acceptance of all s for concludes with acceptance of , whereas rejection of any for at least one causes rejection of .

(t1) We implement the univariate, nonparametric, distribution-free, one-sample Kolmogorov–Smirnov test of hypothesis to check whether the unknown continuous distribution function of a random variable is equal to a completely specified reference distribution . This is done in terms of the test statistic:which involves a distance between the empirical distribution function computed using a random sample on and the cumulative distribution function of the reference distribution. The null hypothesis is accepted if the computed test statistic is smaller than or equal to the upper point of the distribution for the test statistic (equation (3)) under the null.

(t2) Then, we suggest the nonparametric, distribution-free Anderson–Darling test, which is a modification of the Kolmogorov–Smirnov test, assigning more weight to the tails of the distribution for the given sample. It tests whether a univariate sample comes from a population with a specific continuous distribution function . When it is true, we can assume that and the sample are then tested for uniformity [16]. The test statistic is defined as follows:where values, greater than its upper point under the null hypothesis, reject the null of uniformity against the both-sided alternative.

The influence of ties on (t2) varies depending on the characteristics and frequency of ties present in the data. Ties can have a noticeable impact on the precision of the test and potentially affect the test results. Presence of ties disrupts the estimation of the distribution function, particularly in the tails of the distribution, which will lead to inaccurate calculations of the test statistic and value. If the numbers of ties are less or if they are evenly distributed across the data set, their impact on (t2) will be minimal. However, when there are numerous ties or if they cluster around specific values, the precision of the test can be compromised.

2.2. Test Statistics

Suppose the statistics for testing against , carried out in terms of (t1) or (t2), is for . Then, the critical region for the right-tailed alternative in the -th one among the multiple tests is given bywhere is the required upper point of the sampling distribution for our proposed test statistic .

Here, is the nominal level of significance for each of our marginal tests which we perform using the (t1) Kolmogorov–Smirnov test, wherein the asymptotic [17] and (t2) Anderson–Darling test with the asymptotic value for [18, 19]. We denote the statistics for testing against by , where the test statistics from the multiple tests corresponding to the marginals are combined together with equal weights which defines the following:and subsequently we obtain:

Thus, it is a right-tailed test, so the null hypothesis is rejected at level of significance if the observed value of based on the given sample is greater than . Being a data-driven test, the distribution of and the corresponding value (discussed in the following section) are determined empirically.

2.3. Value Computation

To obtain the value of our proposed test, we have calculated the value for the -th marginal test as , for . Since the multiple tests are interdependent, so are their values. There are various ways to combine these values among themselves [2022]. We consider the following:

The null hypothesis is rejected if the value ‘’ computed from the data set is less than its upper point, say , which is estimated as , by applying the bootstrap technique to the given sample. Thus, is rejected in favor of at level of significance if the computed value of .

3. Simulation

The performance of our proposed technique of testing is demonstrated through an extensive simulation study, in this section, where we implement both (t1) and (t2) tests separately. The scenarios from which the samples are drawn are (a) under independence and (b) under dependence structure. Case (a) has the correlation matrix given by , where for and . Thus,

On the other hand, the dependence structure in (b) is induced in two distinct ways.(b1) A nonidentity correlation matrix is considered as , where , is a vector, and is a symmetric matrix for , , and . Thus, explicitly looks like:(b2)The later way of generating random samples from under the dependent setup is carried out through the Clayton copula modeling [23, 24] by implementing the multivariate uniform distribution from Cook and Johnson [25], where the scalar parameter involved in the distribution is taken to be 2.We compute the size and power under (a) and (b) with and , as we focus on the multivariate large astronomical data sets. Both size and power are estimated by the Monte Carlo simulation with the number of replications equals 10,000. The size is estimated as the proportion (out of 10,000) with rejected when the simulated samples are originally drawn from .Analogously, the powers are computed when the simulated samples are not coming from , where we consider the following setups.a1 The multivariate beta (Dirichlet) distribution over the range with the shape parameter vector as and the scale parameter (beta) taken as 3 [26, 27].a2 The truncated multivariate normal distribution, over the range , with the mean vector where , and the correlation matrix: where and with for [28, 29].It is to be noted that the samples are drawn through a Gibbs sampler technique [21, 30] with a thinning of 10 (that is, every 10th observation is selected) to get rid of the autocorrelation present in the synthetic data.a3 Multivariate normal distribution with the same mean vector and the correlation matrix as mentioned in (a2) [31, 32].a4 under independent structure.a5 under independent structure [33, 34].

3.1. Competitor Tests

Several goodness-of-fit tests checking for multivariate uniformity, from Yang and Modarres [35], are considered as competitors: (i) the test based on normal quantiles and (ii) a set of tests based on interpoint distances, as discussed below.

3.1.1. Uniformity Test Based on Normal Quantiles

Suppose the random vectors, for , constitute a random sample of size from a population of the random vector characterized by a continuous multivariate distribution function . We consider the following transformation from to :where is the cumulative distribution function of a standard normal distribution. The test statistics under study is given bywhere .

Under the null setup:where is the identity matrix of order and denotes a -variate normal distribution with the null vector as the mean and the dispersion matrix . It implies and (a central chi-square distribution with degrees of freedom ). Then, testingis equivalent to testing

The null hypothesis is rejected at level of significance if the calculated holds under the null.

3.1.2. Uniformity Test Based on Interpoint Distances

For a given sample of real-valued vectors on , we use a test based on the first two moments of the interpoint distances [36, 37]. The moments and the distribution of the interpoint distances between the multivariate Bernoulli random vectors are investigated by Modaeres in [38], whereas the asymptotic properties of the small interpoint distances in a sample are introduced by Jammalamadaka and Janson [39]. The test, discussed in this section, uses the asymptotic distribution of the sample mean and the sample variance of all interpoint distances.

The sample mean and the sample variance of the interpoint distances are respectively expressed as follows:where their corresponding expectations are as follows:

Under the null , the respective variances would be described by Yang and Modarres [35]:

The central limit theorem for U-process says that under the null, the followings hold (Arcones and Giné [40]):as the first two order moments are independent of each other [41, 42]. Any of the statistics , or (see equations (22) and (23)) may be regarded as our test statistics. The null hypothesis is rejected in favor of the two-sided alternative for large values of the statistic, which is done at level of significance if the calculated value of the test statistics is larger than its upper point under the null.

3.2. Results

In the simulation study, we choose . Tables 13 show the estimated sizes, for samples from the null distributions under (a), (b1) and (b2), are all coming out close to the nominal level of significance, with both the proposed tests (t2) and (t2), for all considered values of and .

To address ties in (t2), the averaging technique has been used. The averaging technique is a tie-breaking method, which involves assigning distinct values to tied observations by taking the average of the tied values. Moreover, as our simulated data set is from setup, it contains an insignificant number of ties for each of the marginal . Hence, the original data set with ties and the modified data set where ties are resolved using tie-breaking techniques are almost alike. By averaging the tied observations, we ensure that each tied value is distinct, allowing (t2) to provide more accurate results and better estimate of the distribution function.

As competitors, we consider the four tests discussed in Section 3.1. They are referred to as their respective test statistics: , and , where we first investigate their empirical sizes under all the conditions as considered for our proposed tests. Tables for competitor tests show that, among all the competitors, only the test for (a) independent samples attains its nominal , whereas it also fails under the more sophisticated multivariate structures such as (b1) and (b2). However, it can be deemed as a rival to compare the performance of our proposed tests.

Just like Tables 13, we have also computed the size values of the test (t2), competitor tests , and competitor tests based on the statistics , and .

Both the first and second proposed tests (t1) and (t2), for the samples from a non-null distribution (a2), exhibit an increasing power computed with the increase in and/or . A maximum of powers for (t1) comes out to be 0.542114 with and (Table 4), whereas (t2) has its highest power calculated as 0.586738 which is attained for and (Table 5). For every choice of , the powers of the tests are optimally good with a value 1 for samples from each of the non-null distributions (a1, a3–a5) under consideration.

The power estimated for the first competitor test comes out to be very low under the non-null distribution (a2). However, it gradually increases with an increase in as well as (Table 6), with a largest value 0.2285. The empirical power takes a value 1 under the distribution (a1). Thus, we comment that, in this situation, our proposed technique with both the tests is competitive with this competitor. Here, the test statistic involved in the rival is based on for ; therefore, among all non-null distributions (a1–a5), only the Dirichlet distribution (a2) and the truncated multivariate normal distribution (3) are considered, for power calculation of test , as those sample values lie in .

For the later set of competitors based on the measures , and , under the non-null distributions (a1) and (a2), the empirical power is increasing in and and reaches 1 for most values of the pair (see, Tables 710). For the samples from (a3–a5), the powers all attain 1. In spite of this optimal power execution, the use of these tests in identifying “completeness in space” is highly questionable due to the drastic failure in satisfying the size condition, even for the multivariate uniform distribution under independence.

4. Application

We apply our proposed technique to the observed data set in space obtained from NEWFIRM Medium Band Survey (NMBS). Data set from the NMBS catalog consists of two versions for the photometric samples as the original SExtractor output and a catalog with additional deblending. We consider the first version that contains the photometric redshifts and rest-frame colors from EAZY, and the stellar population synthesis (SPS) variables from FAST using the Bruzual and Charlot [43] models. Here, we study the early type galaxies (ETGs) [44] from the AEGIS 1 catalog, whose redshift ranges from 0.5 to 4. As our interest is to study the intrinsic properties of the galaxies, we consider the following parameters (variables) that remain invariant with the change in distance: (i) is the -band ellipticity, (ii) is the -band half-light radius, (iii) is the redshift of the galaxies, (iv) is the (age/year), (v) is the (mass/), and (vi) is the (specific star formation rate year).

Our data set consists of the abovementioned variables on 6,661 ETGs. We apply our technique, in terms of the proposed two tests, to investigate whether the data set is “complete in space.” Here, for ‘’ as an observed variable, we consider the following transformation “” as follows:where is the absolute value of , is the natural logarithmic function with base , is the maximum value of , and is the minimum value of . This transformation is done on each of the 6 original variables in such a way that the ranges, under the null hypothesis, remain the same in the transformed space.

We now implement our tests in terms of the value (see, (9)), where we obtain (see Section 2.3) through the nonparametric bootstrap technique (Modak and Bandyopadhyay (2018)) as follows:(i) For each of the multiple tests, we perform bootstrapping individually(ia) bootstrap samples are drawn from the given data set and used to compute number of bootstrap values for the -th marginal test as for (ib) Estimate the upper point for using the sampling distribution of the computed bootstrap values from step (ia) as(ii) Redo steps (ia)-(ib) for (iii) For a given data set, the null hypothesis of multivariate uniformity is rejected at level of significance if

Based on our procedure, the tests (t1) and (t2) both produce a value zero with the cutoff values and , respectively. Therefore, in the light of the sample given, we reject the null hypothesis at level of significance and conclude that the sample does not come from a distribution and hence is not “complete in space.” Moreover, the data set under consideration does not have any ties present in the marginal; hence, no tie-breaking (such as, averaging technique 3.2) technique is required to eradicate the ties from the data set before the application of (t2) in the marginal.

Here, we cross-check our results by the popular and very classical of its kind test [1] from astronomy fraternity. It calls a univariate data set “complete in space,” if , where denotes the mean of the study variable with its maximum . However, it is not a statistical test for an appropriate hypothesis rather provides only a point estimate. Moreover, for multivariate data, only the marginal means are determined independently by this procedure. Anyway, the computed values for corresponding to the 6 study variables are 0.6136369, 0.3906980, 0.4305935, 0.4915570, 0.5898555, and 0.1268829, respectively. It shows the mean values for only 1 amongst 6 marginals are close to 0.5, whereas for the others they are less than 0.5 and for 2 variables it is greater. Therefore, the outcome of rejecting the null distribution resulted in our method is supported by the well-known test.

5. Conclusion

This paper checks the completeness for the multivariate astronomical samples, implementing our novel approach. The advised procedure, using two tests (t1) and (t2), has been shown to perform well with the help of multiple tests of hypotheses and then combining the results of the dependent marginal tests. A few characteristics of our technique are listed below.(1)If an astronomical data set is from a continuous multivariate uniform setup, then it is said to be “complete in space” and vice versa. Our test is the first, to our best knowledge, to check for completeness of an astronomical sample in the multivariate setup.(2)Our approach, although proposed and analyzed for checking multivariate uniformity, can be used for any other arbitrary, continuous, multivariate distribution.(3)We have used two univariate, nonparametric, one-sample tests: (t1) Kolmogorov–Smirnov and (t2) Anderson–Darling, to check for uniformity of the data set, corresponding to each of the marginals. However, any other test, appropriate for use on the multiple tests individually, can be implemented analogously. All the shortcomings of the (t1) and (t2) tests have been taken into consideration before their application.(4)The proposed tests’ efficiency, supremacy, and wide applicability for high-dimensional, big data sets are demonstrated through extensive data study.(5)Our proposed test is established as an efficient method in astronomy for the objective under analysis.

In the near future, we are planning to develop a new test based on the regression analysis to check for completeness of astronomical samples.

Data Availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The author Prasenjit Banerjee acknowledges the research fellowship provided by the University Grants Commission, India, with UGC Id. NOV2017-422665 and UGC reference no. 1141/(CSIR-UGC Net June 2017).