I seem to be missing something about what you are trying to do.
I'm not sure that I understand what you are saying about the pop size in
a tract not influencing the the typology of tracts. Usually the
variables for a clustering are put on some kind of similar scale such a
z-scores or percentages, I have thought of, but not used controlling for
size by using regression residuals.
If you think of a "type" or "class" as a group of tracts that have a
particular type of profile. Clustering tries to find groups are of
cases (tracts) that are very similar to each other, and distinct from
the cases assigned to other groups. Profiles can differ in shape,
elevation, and scatter. You need to consider how tracts are
designated in the first place. Often there is some sense of homogeneity
in designating tracts. my guess is that tracts with many people or
will mostly be apartments. .
In the past, weighting cases has not usually made sense in geopolitical
In clustering of geopolitical areas, you usually have all of the pop of
Further, you are not trying to estimate some parameters of a population.
One result of a clustering is to find a nominal level variable, that is
useful in other analyses (i.e., clustering as a data reduction technique).
Hierarchical methods have only been of value in my work as "slices"
across the tree at different levels. What has been of most use to me
recently in geopolitical clustering and data mining work has been SPSS's
new TWO-STEP clustering approach.
Since the time when I did typologies of counties at the US Census Bureau
in the 70's, more work has been done on clustering compositional data.
I have been meaning to look into this for my work on finding profiles of
US Congressional Districts, but have not gotten to it yet.
Hope this helps.
[log in to unmask]
Social Research Consultants
University Park, MD USA
Moritz Lennert wrote:
>Fionn Murtagh said:
>>Normal situation for clustering used with correspondence analysis. Code
>>in Java is at http://astro.u-strasbg.fr/~fmurtagh/mda-sw I will put
>>code in C for the clustering there soon, and in R for all - corresp.
>>analysis and hier. clustering (weights on cases, min. var. criterion,
>>reciprocal nearest neigh. algorithm) .
>In order to (hopefully) make my question clearer, here is an explanation
>of what we doing currently:
>We have the census tracts of the city of Brussels. We have a series of
>data concerning the housing market in each census tract (type of
>ownership, number of rented appartments, etc). In order to put a bit of
>order into this information, we would like to run a cluster analysis to
>identify different types of ownership/housing structures.
>My question stems from the fact that the total population of housing units
>differ quite strongly from one census tract to the other. We do not want
>tracts with small populations to have the same influence on the types as
>tracts with large populations. Thus the idea of weighting each census
>tract according to its population.
>You seem to be saying that this is a standard situation when clustering
>results of correspondance analyses, but is this used in general
>agglomerative clustering algorithms ?