CLASS-L Archives

January 2004


Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Moritz Lennert <[log in to unmask]>
Reply To:
Classification, clustering, and phylogeny estimation
Mon, 26 Jan 2004 18:38:37 +0100
text/plain (142 lines)
Art Kendall said:
> I seem to be missing something about what you are trying to do.
> I'm not sure that I understand what you are saying about the pop size in
> a tract not influencing the the typology of tracts.  Usually the
> variables for a clustering are put on some kind of similar scale such a
> z-scores or percentages, I have thought of, but not used controlling for
> size by using regression residuals.

What I mean is that given a data matrix based on the census tracts which
contain different population sizes of housing units, we would like to use
a standardized version of this matrix for calculating distances, weighting
each tract by a given amount (in our case number of housing units in the
tract). Up to now we have used the following formula to do this:

dist(a,b) = euclidean_distance(a,b) * (number of units in a * number of
units in b) / (number of units in a + number of units in b)

Where a and b are vectors representing census tracts. Thus, at equal
euclidean distance, a pair of tracts with a high number of housing units
each will be counted as more distant from each other than a pair of tracts
with a low number of units each.

Then, once we have identified the smallest distance on the basis of the
above formula, we calculate the new cluster centroid (clust(aUb))in the
following way:

For each member i of the vector clust(aUb):

clust(aUb)[i]=a[i]*(number of units in a) + b[i]*(number of units in b)

Thus, the resulting centroid is the result of a weighted combination of
the two, the centroid being closer to the centroid of the element between
a and b that has a larger population of housing units.

The combination of these two formulas is what we mean by weighting the
"influence" of each tract by its unit population.

The general idea behind this is that we believe that smaller unit sizes
can create more abitrarily diverse situations and that by weighting in
this manner, we can avoid some of this arbitrary.

> If you think of a "type" or "class" as a group of tracts that have a
> particular type of profile.  Clustering tries to find groups are of
> cases (tracts) that are very similar to each other, and distinct from
> the cases assigned to other groups.  Profiles can differ in shape,
> elevation, and scatter.   You need to consider how tracts are
> designated in the first place. Often there is some sense of homogeneity
> in designating tracts. my guess is that tracts with many people or
> housing units
> will mostly be apartments. .

I will have to check this, but I'm not sure if this is true in such a
linear way.

> In the past, weighting cases has not usually made sense in geopolitical
> clustering applications.
> In clustering of geopolitical areas, you usually have all of the pop of
> entities.
> Further, you are not trying to estimate some parameters of a population.
> One result of a clustering is to find a nominal level variable, that is
> useful in other analyses (i.e., clustering as a data reduction technique).

Yes, we are certainly doing this in a data reduction perspective. We often
use either clustering of factorial analysis (mostly PCA) in this manner.
As mentioned above the idea of weighting stems from the desire to avoid
some of the arbitrary introduced by small pop sizes.

> Hierarchical methods have only been of value in my work as "slices"
> across the tree at different levels.  What has been of most use to me
> recently in geopolitical clustering and data mining work has been SPSS's
> new TWO-STEP clustering approach.

This ressembles the pam and clara functions in R's cluster package,
developed on the basis of Kaufman, L. and Rousseeuw, P.J. (1990). Finding
Groups in Data: An Introduction to Cluster Analysis. Wiley, New York. But
I don't know if the algorithms are really the same.

> Since the time when I did typologies of counties at the US Census Bureau
> in the 70's, more work has been done on clustering compositional data.
> I have been meaning to look into this for my work on finding profiles of
> US Congressional Districts, but have not gotten to it yet.
> Hope this helps.

Definitely, your questions help clarifying my own thoughts. Thank you.


> Art
> [log in to unmask]
> Social Research Consultants
> University Park, MD USA
> (301) 864-5570
> Moritz Lennert wrote:
>>Fionn Murtagh said:
>>>Normal situation for clustering used with correspondence analysis.  Code
>>>in Java is at   I will put
>>>code in C for the clustering there soon, and in R for all - corresp.
>>>analysis and hier. clustering  (weights on cases, min. var. criterion,
>>>reciprocal nearest neigh. algorithm) .
>>In order to (hopefully) make my question clearer, here is an explanation
>>of what we doing currently:
>>We have the census tracts of the city of Brussels. We have a series of
>>data concerning the housing market in each census tract (type of
>>ownership, number of rented appartments, etc). In order to put a bit of
>>order into this information, we would like to run a cluster analysis to
>>identify different types of ownership/housing structures.
>>My question stems from the fact that the total population of housing
>> units
>>differ quite strongly from one census tract to the other. We do not want
>>tracts with small populations to have the same influence on the types as
>>tracts with large populations. Thus the idea of weighting each census
>>tract according to its population.
>>You seem to be saying that this is a standard situation when clustering
>>results of correspondance analyses, but is this used in general
>>agglomerative clustering algorithms ?