OPTIMAL ESTIMATIONS FOR COMPILING SYNTHETIC MAPS
Oleg.K.Mironov
Institute of Geoenvironmental Geosciences
Russian Academy of Science, Moscow
Ulansky per., 13, Moscow, 101000, Russia
e-mail: okm@geoenv.msk.su
RESUME
A problem of compiling synthetic maps is considered from a point of view of the exploratory data analysis. The procedure of the optimal estimation for qualitative cartographical information is offered.
The tasks of compiling synthetic maps on base of various maps of the same region are typical in ecology mapping and other applications.
The most of known examples of solving this problem by GIS technique use logic operations with data shown on source thematic maps (layers).
The corresponding geometric operations are intersection of regions on source maps (overlay) and the superposition of various subject layers.
The success in the compiling a synthetic map by an overlay method is due to two reasons:
If there are numerous source layers and/or a large total number of ranges in their legends, the number of intersection regions becomes very large, and their sizes - very small. Their classification becomes a difficult and non-obvious task. As a rule in such a situation users (cartographers) have not good insight for the classification and do not know exactly which an algorithm for compiling a synthetic map should be executed. In this case it is necessary at first to analize source data, justify a presence of correspondences, regularities and structures in it, and only then to compile asynthetic map.
Traditionally data analysis methods are applied for handling experimental data to check statistical hypotheses (a confirmative analysis) or finding out a compact and intuitive understandable exposition of a data structure at a preliminary stage of study (an exploratory analysis). The most frequently used methods are the principal components method, the discriminant, regression and cluster analysis, projection pursuit.
In the majority of mentioned methods the source information is supposed to be quantitative. Therefore in cartography they are effective for the analysis of a remote sense or a point sampling (for example, geochemical).
The attempts of an application of the data analysis immediately to traditional maps meet difficulties connected with a nonquantitative nature of the mapped information. Even for maps of distribution of quantitative data, e. g. isoline maps of a relief, taking a value at a point of map is connected with additional modelling (creating of a digital elevation model). These procedures are technologically complicated and cannot supply adequate accuracy, as the error of modelling adds to errors of compiling, generalisation and digitization of a map. For the justified analysis it is necessary to attract primary data (including maps of a larger scale). At the same time the data in an ordinal scale (range of a quantitative scale) can easily be obtained immediately from a map.
For the analysis or direct evaluations on quality map data the cartographers use some procedures of digitization. One of the most frequently applied (and the least justified) is the method of an expert scores. A score defined by an expert is attached to each item of source maps legend, regions on the synthetic map are defined by selecting ranges for a sum of scores. The algorithm has an easy software realization. As a rule the correspondence between scores for heterogeneous subject layers is arbitrary, for rare eliminations, when the scores have an interpretation, e. g. in cost expression. In other situations the application of a method of expert scores resembles more an adjustment to the given answer, than a serious research.
The multiple correspondence data analysis allows to select the best values for point estimates in some sense. These estimates can be used for compiling synthetic maps and for the analysis of correlations between entities reflected on maps.
A source information for the procedure is a set of subject layers with various partitions of one region to nonintersecting regions. All information is supposed to be given in a nominal (qualitative) scale. The a priori information (technique of source layers compiling, an information about possible connections between mapped entities) is not concerned.
If to set aside the definition of individual scores, an outcome of an expert score method is a numerical estimate for each point of a map. The synthetic map shows distribution of its values. The task of optimal multidimensional digitization of cartographical data consists in a determination of such numerical estimates in the best way connected to input data.
The optimum criterion is defined as a sum of multiple correlation coefficients of quantity estimates for all points of a map and all source layers. A correlation coefficient of a numerical index given for points of a map, with one layer is a ratio of an intergroup square deviation to the common square deviation of index values, where the partition on classes is set by gradation of a map.
The task of optimal multidimensional digitization is dual to that of cluster analysis. Both tasks consist in finding out an optimal correspondence between quantitative and qualitative data, and the criterion is set by the same formula. For the task of optimal digitization there are many partitions on classes, and one quantitative index is required. For the task of cluster analysis many numerical indexes are given and an optimal partition on classes is needed.
As to find one numerical index, satisfactorily explaining a difference between various layers, as a rule, fails, it is reasonable to put the task of a determination of integrated indexes as many-dimensional. The optimum criterion is the sum of an individual optimum criteria for each numerical indexes. The various numerical indexes are assumed to be uncorrelated (orthogonal).
The task of many-dimensional optimum digitization is solvable, and the algorithm of a solution allows an effective software realization in both vector and raster formats of cartographical data. The necessary operations are finding areas of pairwise intersections of regions and eigen vectors of a symmetric matrix.
Omitting mathematical details, it is possible to present the solution as follows. For each number of indexes there are values of the optimal factor, defined for each gradation on each layer, and
Property 1) shows, that from a computing point of view the optimal many-dimensional digitization is similar to a method of an expert scores, but the values of estimates are not defined with an arbitrariness of the expert, but are calculated with optimization procedure. A property 2) shows, that the optimal factors, in turn, depend on optimal territorial disrtibuted estimates. Therefore the proximity or distance of values of optimal factors for gradations of indications may be interpreted as (in a qualitative sense) proximity or distance mapped entities themselves. If it is possible to assign an interpretation to optimal factors of gradations, the map of distribution of the optimal estimates may be explained too.
The solution has the next interesting properties:
The optimum factors for gradations can be used for classification of connected or anomalous groups. As a value of the optimal factor of gradation is proportional to a sum of optimum estimates of points of a map, described by this gradation, gradations with close regions should be characterized by close values of the optimal factors, and on the contrary. It is reasonable also to consider the two-dimensional scatterplot of projections of gradations to subspaces, spanned by pairs of the factors. The proximity of points on the scatterplot can testify for the benefit of a likeness or correlation of appropriate informative concepts. It is possible to compile maps of distribution of the optimal factors as integrated indexes. When the optimal gradations factors admit an explanation, these maps can also be interpreted.
In spite of not taking by an algorithm into account the order of gradations, at almost all known to the author examples when such order takes place, the values of the first optimal factor for gradations appear to be ordered. Moreover, for almost all layers the increase or decrease of the first factor specifies correlation with some informative indications (in an association from concrete data). These examples testify for the benefit of effectiveness of the circumscribed approach.