• No results found

Medical certificate clustering

when the dimension is further reduced, a gradually loss of information occurs. It is therefore likely to believe that the problem of preserving information increase proportional to the decreased number of dimensions.

When it comes to the second drawback, the lack of subspace cluster identification, this is probably the main contributor to the poor clustering quality. Clearly, in a binary vector of size 10104 where 1 indicates the presence and 0 the absence of a diagnosis, the number of 0’s is of much greater magnitude than the number of 1’s. Also, the fact that a patient had a diagnosis is much more significant than the fact that a patient never had that diagnosis. Equivalent weighting of 0’s and 1’s, which is the case when using Euclidian distance, would therefore probably result in a distance calculation in which the contribution from the 0’s totally outperform the contribution from the 1’s. This was further explored in Section 6.2.

6.6 Medical certificate clustering

As mentioned in Section 5.2.2 there are several possible reasons for why the clus-tering performed on the medical certificates did not manage to form significant groups. In the following, hypotheses of such reasons are stated. For each hypoth-esis, possible strategies of how to deal with the problems are discussed.

Hypothesis 1: There are no natural clusters in the data set. This hypothesis suggests that there is no clustering tendency in the data set. Such a hypothesis implies that interesting clusters will never appear regardless of the choice of attributes.

(Gri05) argue that there should exist potential groups of patients among the med-ical certificates, especially among the long-term certificates. As an example, he proposes that there should exist a potential partition between patients that have been reported sick due to real illness and patients that have been reported sick due to a lack of function or well-being in the working life. He suggests that such groups can be revealed by comparing the degree of subjectivity or objectivity for a diagnosis, such that patient groups with a clearly demonstrable ailment could be separated from patients with more vague or personally experienced diagnoses.

A concrete potential group could be formed by uneducated but gifted housewives that started working when they were 35-40 in not very challenging jobs. Ac-cording to (Gri05), a significant amount of patients reported long-term sick have this background. Typical diagnoses for this group are vague diagnoses such as depressive disorder or back syndrome.

6.6 Medical certificate clustering 6 DISCUSSION

Other potential groups of patients are groups that arise due to reorganised or closed companies. In a small or medium society like the one from where the data is collected a reorganisation or closed firm is likely to affect the sick com-munity reporting statistic such that groups of patients with similar occupational background appear in the data set.

These arguments do not prove that there exist groups in the data set used for the clustering task. For instance, the data set can be to small and therefore not constitute a representative selection of the population. However, (Gri05) possesses detailed knowledge about the data set which indicates that it is likely to believe that such groups exist. Possible explanations of why such groups did not appear during the clustering performed in this work are explored in the following.

Hypothesis 2: There are no natural clusters based on the selected attributes. This hypothesis suggests that the data set potentially contains clus-ters, but that the selected attributes are unsuitable to form interesting clusters regardless of attribute weighting.

(Gri05) argues that the selection of attributes utilised for the medical certificate clustering seems sensible. However, for some of the attributes the definition of subgroups of attributes, or a ranging or detailing of the attributes could bring in more relevant information in the clustering process. He suggests some possible strategies that could be followed to include this additional relevant information about the selected attributes.

First, due to the recently described motives, he suggests that information should be included which describes the degree of subjectivity or objectivity of a diagno-sis. For the musculoskeletal ailments typically objective diagnoses are reumatoid arthritis while more subjective diagnoses are back or neck syndrome. Corre-sponding examples from the set of psychological ailment are the objective disor-ders psychosis and schizophrenia compared to the subjective diagnoses depressive disorder or anxiety disorder.

Secondly, (Gri05) indicates that there is a correlation between occupation and kind of sickness, which points to the advantage of including more detailed infor-mation about a patient’s occupation. For all medical certificates the occupation of the patient is registered. Unfortunately, the directives for these noting are not very strict, which results in an infinite number of possible employments. To utilise this information a manual inspection and a grouping of the values is required.

This could help revealing potantial correlations and could also help to identify the recently mentioned groups arised due to reorganisations in the working life.

6 DISCUSSION 6.6 Medical certificate clustering

Additionally, (Gri05) suggests some new attributes that could introduce new knowledge in the clustering. One of these attributes are hospital referrals. He argues that especially referrals prescribed some time in advance of the sickness reporting can indicate a seriously sick patient. He also mentions x-ray pictures as an indication of seriously sick patients.

Another possible attribute that could be included is the nationality of the patient.

The patient journal system offers the possibility to register this information, but the normal practice is to omit it. Such information could possibly reveal in-teresting information due to the known connection between foreigners and both unemployment and social mechanism such as exclusion. There could also be cor-relations between nationality and ailments, the occurrences of for example type two diabetes are known to be frequent among some foreigner groups due to the changed diet related to the relocation.

Hypothesis 3: The uninteresting results arose due to the utilisation of unsuitable methods. This hypothesis suggests that the selected features can potentially be used to form interesting clusters, but that the choice of methods like the distance measure or the clustering algorithm caused the poor results.

The most crucial part of the clustering process is probably the distance calcu-lation. The strategy for measuring the distances was probably to simple in this clustering task. The age was the only attribute for which the potential values of distance constituted a range. For the other attributes, the compared objects were considered either equal or not equal, the degree of similarity was not taken into account. A possible method that could help improving the distance measuring of an attribute could be to structure the values for an attribute in groups, hierar-chies or ranges. For instance, the correlations between ICPC codes calculated in the ICPC clustering task could be utilised also in this clustering task to calculate diversified distances between codes and code-groups. A similar ranging could possibly be performed also for the marital and occupational status. Obviously, married or cohabitants could be considered more similar to each other than to widows or single people. Also, social security recipients and unemployed could be considered similar to each other and dissimilar to employed patients.

As mentioned in 4.2.3 the clustering algorithm utilised in this clustering task was hierarchical clustering. All three merge strategies were tested and especially the average distance strategy gave some interesting smaller clusters. This points to the importance of selecting an appropriate algorithm. It is likely to believe from the results achieved both by clustering and by counting that the potential inter-esting clusters in this data set are of varying size and that clustering algorithms such as hierarchical clustering with maximum distance that search for spherical clusters of similar size will fail. The minimum distance strategy and the average

6.6 Medical certificate clustering 6 DISCUSSION

distance strategy could potentially identify clusters of dissimilar size, but, except from the few interesting clusters formed by average distance, they fail in this clustering task, probably due to the algorithmic lack of iterative object replacing.

A hierarchical agglomerative clustering algorithm that both takes categorical val-ues and identifies clusters of varying size is the Rock algorithm. Rock introduces two new concepts:

• neighbor: an object has the number of neighbors equal to the number of other objects considerably similar to the object

• link: the link between two objects equals the number of common neighbors

The decision of which clusters to merge are based upon the number of links between the clusters.

Another type of algorithm that handles categorical values are the k-medoid al-gorithms, which is variations of the k-mean algorithm where the clustering are based on median points instead of mean points. However, these algorithms tend to form spherical clusters of similar size.

An interesting group of algorithms for this task is the group of density based algorithms. These algorithms consider regions of high density as clusters, and regions of low density as noise. This causes the low density regions not to be included in any cluster, which potentially can prevent outliers to weaken the cluster concepts. The density based algorithms consider only the local region of an object to decide if an object should grow into a cluster. This renders possible the growth of clusters of arbitrary shape. The density based algorithms were refused introductory in this work due to the number of influential user parameters.

However, a rational choice of parameters could potentially have caused interesting results.

7 CONCLUSION

7 Conclusion

This thesis demonstrates that the application of clustering methods to a patient record can potentially identify well-known medical information and with that also potentially identify so far undiscovered significant knowledge. However, only a minority of the methods tried out in this thesis managed to reveal known information. Conclusions regarding the fitness of the tested methods are given in the following. It must be emphasised that the conclusions only consider the method’s suitability for the data sets to which they are applied and that they are likely to lead to other conclusions when applied to other data sets.

The findings regarding merge strategies may be summarized thus:

• The maximum distance strategy showed the overall best performance com-pared to the average distance and the minimum distance strategy. The maximum distance tended to form conceptual meaningful clusters of sim-ilar size. Both the minimum distance strategy and the average distance strategy tended to form one single large cluster, the former to a greater degree than the latter.

• In this thesis the average distance strategy shows a tendency to identify smaller, conceptual and meaningful clusters that is not identified by the maximum distance strategy. This indicates that the minimum and the av-erage distance strategies can potentially perform better than the maximum distance strategy when the natural clusters underlying the data set are elongated or of dissimilar size.

The findings regarding distance measures may be summarised thus:

• Lift correlation and the Jaccard coefficient performed better than the Euclid-ian distance for long binary vectors with a major portion of 0’s.

• The clusterings obtained by correlation seemed more meaningful than the clusterings obtained by the Jaccard coefficient.

The findings regarding quality measures may be summarised thus:

• The quality measures evaluated in this thesis, namely the Dunn index, the modified Hubert Γ statistic and the Davies-Bouldin index neither agreed with each other nor with medical expertise in regard to what is considered a good clustering.

7 CONCLUSION

• The three measures are based on different clustering characteristics. The quality measured by different quality indices are will therefore at times contradict each other. This emphasises the importance of the user being conscious of what exactly the index shows when the results are interpreted.

• The measured results did not reflect the observed degree of meaning in the clusterings. This underlines the absolute necessity of human inspection and interpretation of the clusterings.

• The experiments demonstrate that the measured quality is highly depen-dent on the distance measured. This implies the following:

– A measured quality is not absolute but rather relative compared to other values measured.

– The comparison of quality measures calculated by use of different dis-tance measure is not meaningful.

The findings regarding principal component analysis may be summarised thus:

• The PCA experiments demonstrated that a substantial reduction of the number of attributes potentially causes only an insignificant loss of infor-mation. For the ICPC clustering, the number of features was reduced from 7000 to 226 without having any influence at all on the results; the cluster-ings obtained by the two data sets were identical.

• PCA increased the conceptual clustering quality compared to the quality obtained by clustering the full data set.

• The clustering tendency increased when the number of features decreased, so that strategies that gave no more than one cluster for the full data set caused several sensible clusters for the reduced data set.

An additional finding in this thesis was that a significant part of the problems associated with clustering is the seemingly overwhelming number of available strategies. The number of possible attribute subsets to represent a data object is very large. There are a great number of available strategies for each preprocessing step as those for normalisation and for the replacement of missing values and there are plenty of distance measures and numerous clustering algorithms. The greatest challenge connected with a clustering task is to make the correct choices in the labyrinth of possible strategies.

7 CONCLUSION

Finally, one additional experience gained from this work was that it would be advantageous to design the applications areas and the concrete data sets in ad-vance of the choice of clustering methods to increase the suitability of the selected methods to the data sets.

7 CONCLUSION

REFERENCES REFERENCES

References

[BWH01] Rohan A. Baxter, Graham J. Williams, and Hongxing He. Feature selection for temporal health records. Proceedings of the 5th Asia-Pacific Conference on Knowledge Discovery and Data Mining, 2001.

[Com04] Wonca International Classification Committee. Den internasjonale klassifikasjon for primærhelsetjenesten. Wonca International Classi-fication Committee, 2004.

[CR98] James C.Bezdak and Nikhil R.Pal. Some new indexes of cluster va-lidity. IEEE transactions on systems, man and cybernetics-part B:

vol.28,No.3 June 1998, 1998.

[Fis36] R. Fischer. The use of multiple measurements in taxonomic problems.

Annals of eugenics 7, pp 179-188,1936, 1936.

[Gra] Graphviz - graph visualization software http://www.graphviz.org/

About.php.

[Gri05] Anders Grimsmo, 2005. Personal correspondence.

[H.D02] Margaret H.Dunham. Data Mining, introductory and advanced topics.

Prentice Hall, 2002.

[HK01] Jiawei Han and Micheline Kamber. Data Mining, concepts and tech-niques. Academic Press, 2001.

[HPA] The health personnel act http://www.lovdata.no/all/

nl-19990702-064.html.

[ICP] Icpc-2 - international classification of primary carehttp://www.kith.

no/templates/kith_WebPage____1186.aspx.

[I.T02] I.T.Jolliffe. Principal Component Analysis, Second Edition. Springer, 2002.

[ML00] A. Kandel M. Last, O. Maimon. Knowledge discovery in mortality records: an info-fuzzy approach. Medical Data Mining and Knowledge Discovery, K. Cios (Ed.), Studies in Fuzziness and Soft Computing, Vol. 60, Springer-Verlag, pp. 211-235, 2001, 2000.

[Pat02] Anne Patrikainen. Projected clustering of high-dimensional binary data. PhD thesis, 2002.

[PDA] The personal data act http://www.ub.uio.no/ujur/ulovdata/

lov-20000414-031-eng.pdf.

REFERENCES REFERENCES

[PHD] Personal health data filing system act http://www.ub.uio.no/ujur/

ulovdata/lov-20010518-024-eng.pdf.

[Pro] Profdochttp://www.profdoc.no.

[RAR95] Dimitrios Gunopulos Rakesh Agrawal, Johannes Gerbe and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. 1995.

[R.B61] R.Bellmann. Adaptive control processes: A guided tour. Princeton University Press, 1961.

[The] R http://www.r-project.org.

[TK99] Sergios Theodoridis and Konstantinos Koutroumbas. Pattern Recog-nition. Academic Press, 1999.

[Tsu01] S. Tsumoto. Discovery of clinical knowledge in databases extracted from hospital information systems. Medical Knowledge Discovery, 2000., 2001.

[WL02] Machiel Westerdijk and Martijn Ludwig. Product specification by find-ing homogeneous groups of care episodes in hospital data. PCSE-conference, 2002.

A SOURCE CODE

A Source code

This appendix lists the source code utilised in this work. Section A.1 contains a summary of the methods, while Section A.2 lists the Python code.

A.1 Method summary

F U N C T I O N S

PCA ( i n p u t a r r a y , o u t p u t n u m )

p e r f o r m s p r i n c i p a l c o m p o n e n t a n a l y s i s of an a r r a y

a v g _ l i n k ( cl ust ers , pmatrix , ofile , f r e q c o d e s , binpats , codes , b i n p r o x m a t r i x , b i n p r o x m e a n , b i n p r o x v a r , d i s t m e a s u r e , a t t r t y p e )

i m p l e m e n t a t i o n of the a v e r a g e d i s t a n c e s t r a t e g y c a l c D i s t B e t w e e n C l u s t e r s ( means , d i s t m e a s u r e )

c a l c u l a t e s the d i s t a n c e b e t w e e n the m e a n s of eac h c l u s t e r c a l c u l a t e D i s t a n c e ( pat1 , pat2 , d i s t m e a s u r e )

c o n t r o l s the d i s t a n c e c a l c u l a t i o n c a l c u l a t e E u c l i d ( pat1 , pa t2 )

c a l c u l a t i o n of E u c l i d i a n d i s t a n c e c a l c u l a t e J a c c a r d ( pat1 , pa t2 )

c a l c u l a t i o n of J a c c a r d c o e f f i c i e n t c a l c u l a t e M e a n s ( pats , c l u s t e r s )

c a l c u l a t i o n of m e a n s d u r i n g k - m e a n s

c l u s t e r i n g ( i n p u t f i l e , o u t p u t f i l e , c lus tal g , s trat egy ,

n u m c l u s t e r s , d i s t m e a s u r e , pca , n u m p c a f e a t u r e s , a t t r t y p e ) c o n t r o l s c h o i c e s of c l u s t e r i n g s t r a t e g i e s

c o m p a r e P a t ( file1 , f i l e 2 )

r e m o v e s p a t i e n t s that do not o c c u r in both t e m p o r a r y and long - te rm g r o u p

c o u n t C o d e s ( i n p u t f i l e )

c o u n t s t o t a l n u m b e r and f r e q u e n c y of c o d e s c r e a t e B i n V e c t o r s ( patfile , c o d e s )

c r e a t e s b i n a r y v e c t o r s

d a v i e s _ B o u l d i n ( clu ster s , binpats , d i s t m e a s u r e )

A.1 Method summary A SOURCE CODE

c a l c u l a t e s the Davies - B o u l d i n i n d e x for a c l u s t e r i n g d u n n _ i n d e x ( clu ste rs , b i n p m a t r i x )

c a l c u l a t e s the D unn i n d e x for a c l u s t e r i n g f i n d C o d e s ( patlist , f r e q c o d e s , binpats , c o d e s )

r e t u r n s co de and t e x t u a l d e s c r i p t i o n for the m ost f r e q u e n t c o d e s .

f i n d I l l f o r m I n f o ( list , b i n p a t s )

r e t u r n s i n f o r m a t i o n a b o u t the m e d i c a l s e r t i f i c a t e s h i e r a r c h i c a l ( pats , s trat egy , d i s t m e a s u r e , o u t p u t f i l e ,

f r e q c o d e s , binpats , codes , a t t r t y p e )

i m p l e m e n t a t i o n of pur e h i e r a r c h i c a l c l u s t e r i n g

k m e a n s ( pats , meansnr , d i s t m e a s u r e , o u t p u t f i l e , f r e q c o d e s , binpats , codes , a t t r t y p e )

i m p l e m e n t a t i o n of the k - m e a n s a l g o r i t h m

m a x _ l i n k ( cl ust ers , pmatrix , ofile , f r e q c o d e s , binpats , codes , b i n p r o x m a t r i x , b i n p r o x m e a n , b i n p r o x v a r , d i s t m e a s u r e , a t t r t y p e )

i m p l e m e n t a t i o n of the m a x i m u m d i s t a n c e s t r a t e g y m o d _ H u b e r t _ s t a t ( clu ster s , binpats , p , pmean , pvar ,

d i s t m e a s u r e )

c a l c u l a t e s the m o d i f i e d H u b e r t g a m m a s t a t i s t i c for a c l u s t e r i n g

p r e p r o c e s s I l l f o r m ( i n p u t f i l e , o u t p u t f i l e )

p e r f o r m s p r e p r o c e s s i n g of m e d i c a l s e r t i f i c a t e s . I n s e r t s d e f a u l t values , n o r m a l i s e s age a t t r i b u t t ,

d e l e t e s f i e l d s tha t l a c k s c r u c i a l i n f o r m a t i o n q u a l i t y _ m e a s u r e _ h e l p ( binpats , d i s t m e a s u r e )

c a l c u l a t e s the p r o x i m i t y m a t r i x and its mea n and v a r i a n c e for the o r i g i n a l b i n a r y m a t r i x .

Us ed for c a l c u l a t i n g H u b e r t s t a t i s t i c w hen the p r o x i m i t y m a t r i x are c h a n g e d due to PCA .

s i n g l e _ l i n k ( c lus ters , pmatrix , ofile , f r e q c o d e s , binpats , codes , b i n p r o x m a t r i x , b i n p r o x m e a n , b i n p r o x v a r , d i s t m e a s u r e , a t t r t y p e )

i m p l e m e n t a t i o n of the m i n i m u m d i s t a n c e s t r a t e g y w r i t e G r a p h V i z ( pmatrix , c o d e s )

w r i t e s i n f o r m a t i o n a b o u t c l u s t e r i n g on a v a l i d G r a p h V i z i n p u t f o r m a t

A SOURCE CODE A.2 Python source code

w r i t e R e s u l t ( c lus ters , i t e r a t i o n , ofile , f r e q c o d e s , binpats , codes , b i n p m a t r i x , b inp mean , binpvar , d i s t m e a s u r e ,

a t t r t y p e )

w r i t e s r e s u l t s fro m the c l u s t e r i n g p r o c e d u r e to f ile

A.2 Python source code

fr om N u m e r i c i m p o r t * fr om rpy i m p o r t * fr om MLa b i m p o r t m ean i m p o r t r a n d o m

i m p o r t mat h

def c l u s t e r i n g ( i n p u t f i l e , o u t p u t f i l e , cl ust alg , s tra teg y , n u m c l u s t e r s , d i s t m e a s u r e , pca , n u m p c a f e a t u r e s , a t t r t y p e ) :

’ ’ ’

c o n t r o l s c h o i c e s of c l u s t e r i n g s t r a t e g i e s

’ ’ ’

if a t t r t y p e ==" b i n a r y _ c o d e s " or a t t r t y p e ==" I C P C c l u s t e r ":

codes , f r e q c o d e s = c o u n t C o d e s ( i n p u t f i l e ) pa ts = c r e a t e B i n V e c t o r s ( i n p u t f i l e , c o d e s )

# c l u s t e r IC PC not p a t i e n t s if a t t r t y p e ==" I C P C c l u s t e r ":

pa ts = t r a n s p o s e ( pat s ) el if a t t r t y p e ==" i l l f o r m ":

p r i n t " i l l f o r m w o r k e d "

pa ts =[]

i n f i l e = ope n ( i n p u t f i l e ," r ") for lin e in i n f i l e :

l o c a l p a t =[]

for ite m in ( l ine . s t r i p () . s p l i t () ) [ 1 : ] : l o c a l p a t . a p p e n d ( ite m )

pa ts . a p p e n d ( l o c a l p a t ) f r e q c o d e s =0

c o d e s =0

# PCA r e d u c t i o n if pca ==" PCA ":

r e d p a t s = PCA ( pats , n u m p c a f e a t u r e s ) el se :

r e d p a t s = pa ts [:]

# s e l e c t c l u s t e r i n g a l g o r i t h m if c l u s t a l g ==" h i e r a r c h i c a l ":

h i e r a r c h i c a l ( redpats , st rat egy , d i s t m e a s u r e , o u t p u t f i l e , f r e q c o d e s , pats , codes , a t t r t y p e )

A.2 Python source code A SOURCE CODE

A SOURCE CODE A.2 Python source code

r e t u r n pat s

def k m e a n s ( pats , meansnr , d i s t m e a s u r e , o u t p u t f i l e , f r e q c o d e s , binpats , codes , a t t r t y p e ) :

’ ’ ’

i m p l e m e n t a t i o n of the k - m e a n s a l g o r i t h m

’ ’ ’ m e a n s =[]

for ite m in m e a n s n r :

m e a n s . a p p e n d ( p ats [ i tem ]) o o u t p u t = op en ( o u t p u t f i l e ," w ") i t e r a t i o n =1

c l u s t e r l i s t =[]

o l d c l u s t e r l i s t = [0]

b i n p r o x m a t r i x , b i n p r o x m e a n , b i n p r o x v a r = q u a l i t y _ m e a s u r e _ h e l p ( pats , d i s t m e a s u r e )

w h i l e ( o l d c l u s t e r l i s t != c l u s t e r l i s t ) : if 1 < i t e r a t i o n < 40:

m e a n s = c a l c u l a t e M e a n s ( pats , c l u s t e r s ) c l u s t e r s ={}

for i , o b j e c t 1 in e n u m e r a t e ( pats ) : m i n d i s t =99 9

me an = -1

for j , o b j e c t 2 in e n u m e r a t e ( m e a n s ) :

di st = c a l c u l a t e D i s t a n c e ( object1 , object2 , d i s t m e a s u r e ) if dist < m i n d i s t :

m i n d i s t = di st me an = j

if c l u s t e r s . h a s _ k e y ( mea n ) :

c l u s t e r s [ me an ]. a p p e n d ( i ) el se :

c l u s t e r s [ me an ]=[ i ] o l d c l u s t e r l i s t = c l u s t e r l i s t [:]

c l u s t e r l i s t = c l u s t e r s . v a l u e s () c l u s t e r l i s t . s ort ()

if c l u s t e r l i s t != o l d c l u s t e r l i s t :

w r i t e R e s u l t ( c l u s t e r l i s t , i t e r a t i o n , ooutput , f r e q c o d e s , binpats , codes , b i n p r o x m a t r i x , b i n p r o x m e a n , b i n p r o x v a r , d i s t m e a s u r e , a t t r t y p e )

i t e r a t i o n +=1

A.2 Python source code A SOURCE CODE

def h i e r a r c h i c a l ( pats , str ateg y , d i s t m e a s u r e , o u t p u t f i l e , f r e q c o d e s , binpats , codes , a t t r t y p e ) :

’ ’ ’

i m p l e m e n t a t i o n of pur e h i e r a r c h i c a l c l u s t e r i n g

’ ’ ’

if a t t r t y p e ==" b i n a r y _ c o d e s " or a t t r t y p e ==" I C P C c l u s t e r ":

b i n p r o x m a t r i x , b i n p r o x m e a n , b i n p r o x v a r =

q u a l i t y _ m e a s u r e _ h e l p ( binpats , d i s t m e a s u r e ) el if a t t r t y p e ==" i l l f o r m ":

a t t r t y p e s = p ats [0]

pa ts = pa ts [ 1:]

# c a l c u l a t e s p r o x i m i t y m a t r i x p m a t r i x = [ 0 . 0 ]

p m a t r i x = p m a t r i x * len ( pat s )

for i , ite m in e n u m e r a t e ( p m a t r i x ) : p m a t r i x [ i ] = [ 0 . 0 ]

p m a t r i x [ i ]= p m a t r i x [ i ]* len ( pa ts )

for i , pat 1 in e n u m e r a t e ( p ats ) :

for j , pat2 in e n u m e r a t e ( p ats [ i + 1 : ] ) : if a t t r t y p e ==" i l l f o r m ":

p a t 1 a n d 2 =[ pat1 , p at2 ] p m a t r i x [ i ][ j + i + 1]=

c a l c u l a t e D i s t a n c e ( pat 1and 2 , a t t r t y p e s , d i s t m e a s u r e ) el se :

p m a t r i x [ i ][ j + i + 1]=

c a l c u l a t e D i s t a n c e ( pat1 , pat2 , d i s t m e a s u r e )

if a t t r t y p e ==" i l l f o r m ":

b i n p r o x m a t r i x =[]

for ite m in p m a t r i x :

b i n p r o x m a t r i x . a p p e n d ( i tem [: ])

c l u s t e r s =[]

for i , line in e n u m e r a t e ( p ats ) : c l u s t e r s . a p p e n d ([ i ]) o f i l e = ope n ( o u t p u t f i l e ," w ") if s t r a t e g y ==" s i n g l e ":

c l u s t e r s = s i n g l e _ l i n k ( clu ster s , pmatrix , ofile , f r e q c o d e s , binpats , codes , b i n p r o x m a t r i x ,

b i n p r o x m e a n , b i n p r o x v a r , d i s t m e a s u r e , a t t r t y p e ) if s t r a t e g y ==" max ":

A SOURCE CODE A.2 Python source code

A.2 Python source code A SOURCE CODE

A SOURCE CODE A.2 Python source code

A.2 Python source code A SOURCE CODE

for i , ite m in e n u m e r a t e ( p m a t r i x [: -1]) : l o c a l m i n = min ( ite m [ i + 1 : ] )

if l ocal min < m i n d i s t : m i n d i s t = l o c a l m i n ip os = i

jp os = p m a t r i x [ ip os ][ i pos + 1 : ] . i n d e x ( m i n d i s t ) jp os = jp os + ipo s +1

# c a l c u l a t e new a v e r a g e s k =0

l = len ( p m a t r i x )

p m a t r i x [ ip os ][ j pos ] = 0 . 0 w h i l e k < l :

p m a t r i x [ min ( k , i pos ) ][ max ( k , i pos ) ]=(

p m a t r i x [ min ( k , i pos ) ][ max ( k , i pos ) ]* len ( c l u s t e r s [ ip os ]) + p m a t r i x [ min ( k , jp os ) ][

max ( k , jpos ) ]* len ( c l u s t e r s [ jp os ]) ) /( len ( c l u s t e r s [ i pos ]) + len ( c l u s t e r s [ jp os ]) ) k +=1

# pop and a p p e n d p m a t r i x . pop ( jpo s )

for i , ite m in e n u m e r a t e ( p m a t r i x ) : p m a t r i x [ i ]. pop ( jpo s ) for ite m in c l u s t e r s [ j pos ]:

c l u s t e r s [ ip os ]. a p p e n d ( it em ) c l u s t e r s . pop ( jpo s )

if len ( c l u s t e r s ) <50:

d u n n f i l e . w r i t e ( str ( d u n n _ i n d e x ( cl ust ers , b i n p r o x m a t r i x ) ) +"\ n ")

h u b e r t f i l e . w r i t e ( str ( m o d _ H u b e r t _ s t a t ( cl ust ers , binpats , b i n p r o x m a t r i x , b i n p r o x m e a n , b i n p r o x v a r , d i s t m e a s u r e ) ) +"\ n ")

d b f i l e . w r i t e ( str ( d a v i e s _ B o u l d i n ( c lust ers , binpats , d i s t m e a s u r e ) ) +"\ n ")

if len ( c l u s t e r s ) <11: # w r i t e to fil e onl y the l ast i t e r a t i o n s

w r i t e R e s u l t ( c lust ers , len ( c l u s t e r s ) , ofile , f r e q c o d e s , binpats , codes , b i n p r o x m a t r i x , b i n p r o x m e a n , b i n p r o x v a r , d i s t m e a s u r e , a t t r t y p e )

def c a l c u l a t e D i s t a n c e ( pat1 , pat2 , d i s t m e a s u r e ) :

’ ’ ’

c o n t r o l s the d i s t a n c e c a l c u l a t i o n

’ ’ ’

A SOURCE CODE A.2 Python source code

A.2 Python source code A SOURCE CODE

A SOURCE CODE A.2 Python source code

A.2 Python source code A SOURCE CODE

A SOURCE CODE A.2 Python source code

A.2 Python source code A SOURCE CODE

A SOURCE CODE A.2 Python source code

A.2 Python source code A SOURCE CODE

def q u a l i t y _ m e a s u r e _ h e l p ( binpats , d i s t m e a s u r e ) :

’ ’ ’

c a l c u l a t e s the p r o x i m i t y m a t r i x and its mea n and v a r i a n c e for the o r i g i n a l b i n a r y m a t r i x .

Us ed for c a l c u l a t i n g H u b e r t s t a t i s t i c w hen the p r o x i m i t y m a t r i x are c h a n g e d due to PCA .

’ ’ ’

p m a t r i x = [ 0 . 0 ]

p m a t r i x = p m a t r i x * len ( b i n p a t s ) for i , item in e n u m e r a t e ( p m a t r i x ) :

p m a t r i x [ i ] = [ 0 . 0 ]

p m a t r i x [ i ]= p m a t r i x [ i ]* len ( b i n p a t s ) for i , pat1 in e n u m e r a t e ( b i n p a t s ) :

for j , pat 2 in e n u m e r a t e ( b i n p a t s ) :

p m a t r i x [ i ][ j ]= c a l c u l a t e D i s t a n c e ( pat1 , pat2 , d i s t m e a s u r e )

m e a n p m a t r i x = m ean ( m ean ( p m a t r i x ) ) v a r p m a t r i x =0

for i , i t e m 1 in e n u m e r a t e ( p m a t r i x ) :

for j , i t e m 2 in e n u m e r a t e ( i t e m 1 ) :

v a r p m a t r i x += p m a t r i x [ i ][ j ]**2 - m e a n p m a t r i x

**2 v a r p m a t r i x /= len ( p m a t r i x ) **2 v a r p m a t r i x = sqr t ( v a r p m a t r i x )

r e t u r n pmatrix , m e a n p m a t r i x , v a r p m a t r i x

def m o d _ H u b e r t _ s t a t ( cl uste rs , binpats , p , pmean , pvar , d i s t m e a s u r e ) :

’ ’ ’

c a l c u l a t e s the m o d i f i e d H u b e r t g a m m a s t a t i s t i c for a c l u s t e r i n g

’ ’ ’

if len ( c l u s t e r s ) >1:

l =[ 0]

l = l * len ( b i n p a t s ) c l u s t e r d i c t ={}

for i , c l u s t e r in e n u m e r a t e ( c l u s t e r s ) : c l u s t e r d i c t [ i ]= c l u s t e r

for p a t i d in c l u s t e r : l [ p a t i d ]= i

# c a l c u l a t e mea n for e ach c l u s t e r

# c a l c u l a t e mea n for e ach c l u s t e r