Visual Mining of Text Collections
Rosane Minghim1and Haim Levkowitz2
1ICMC - Instituto de Ciências Matemáticas e de Computação University of São Paulo
Brazil
2Institute for Visualization and Perception Research University of Massachusetts Lowell
USA
Abstract
What happens if you have to examine and reach conclusions from a considerable number of textual documents? If you are faced with this task or with developing tools for completing this task, this tutorial is for you.
Examining text is crucial for many different types of applications. Even applications that rely on additional types of data (such as images, signals, simulations) usually have complementary or alternative text based output. The challenge of interpreting content and extracting useful information from a document collection is the target of efforts in various areas of computer science. Fields such as Text Mining try to extract knowledge automatically or semi-automatically from collected textual information; however, due to the multi-dimensional characteristics of text it is paramount to couple these algorithms with meaningful visual representations in order to improve performance and allow the discovery of relevant information within a text data set.
Since it is not feasible to go through the entire documents’ content in detail due to data sizes and time constraints, Visual Text Mining (VTM) — the combination of Text Mining and Visualization — is focused on developing tools to help users extract meaning from text collections without extensive reading.
In this tutorial we introduce the necessary background and the graphical techniques involved in Visual Text Mining of document collections.
Contents
1 Overview, motivation, goals
2 Two test cases: Visual maps of flash news and scientific papers 2.1 Visual mapping of news flashes
2.2 Visual mapping of scientific papers 3 Basic Concepts
3.1 Text processing and information retrieval 3.2 Data and text mining
3.3 Projection techniques
3.4 Visual representations: graphs, surfaces, volumes, triangulations 4 From Visualization to Visual Text Mining
4.1 Visualization techniques for multidimensional data
4.2 Visualization techniques and systems for handling document collections 4.3 Visual text mining
5 Projection Based Visualization and its Application to Visual Text Mining 5.1 Projection techniques and point placement strategies
5.2 Mapping text collections via projections and point placement 5.3 Topic extraction and visualization
5.4 Further Examples
6 Conclusions, Current Challenges, Future Trends 7 Acknowledgments
Appendices
A Technical Report: VISUAL MAPPING OF TEXT COLLECTIONS USING AN APPROXIMATION OF THE KOLMOGOROV COMPLEXITY A.1 Introduction
A.2 Previous Work
A.3 Projection techniques for text visualization
A.4 Kolmogorov Complexity as a means to define distance between texts A.5 Results
A.6 Conclusions
B Technical Report: CONTENT-BASED DOCUMENT MAPS USING FAST PROJECTIONS AND TOPIC DETECTION B.1 Multidimensional Projections for Mapping Collections of Documents
B.2 Exploring Content-based Document Maps B.3 Further Remarks
References
1. Overview, motivation, goals
Text collections are generally considered data sets with a high number of dimensions, where a dimension is a term or expression (an n-gram) of importance within the domain of the document collection. In a conventional vector represen- tation of a document data set, the number of dimensions for a few hundred documents of moderate size can reach a few thousand.
Regardless of these difficulties and their implications for the analysis of text collections, there is a growing number of applications that can benefit from tools to effectively support analysis of document sets and from that analysis, allow the user to reach conclusions and to make decisions. The range of potential applications vary from health studies and diag- nosis based on medical records to investigation of unlaw- ful activities. The nature of the textual source is also quite varied. Documents can be snippets from web searches, RSS feeds of various kinds, scientific papers, reports, newspapers articles, automatically generated health test reports, patents, and so on. Every day new applications start to rely on text solely or combined with other data sources (such as table data and images). There is, therefore a compelling need for tools that combine user driven and automatically extracted displays to support the analysis of text collections and rela- tionships amongst texts.
Finding ways to support analysts in quickly and mean- ingfully extracting meaning from text collections is also a strategic issue for the area of Visual Analytics, defined as
"the science of analytical reasoning facilitated by interactive visual interfaces" (from http://nvac.pnl.gov/). The fields of Text Mining and Visual Text Mining aim at combining visu- alization and mining approaches to achieve solutions for the exploration of text collections.
Handling high-dimensional data poses many problems to researchers and data analysts in general. Visualization tech- niques seek to bridge the gap between users’ visual per- ception and reasoning capabilities and analytical techniques.
Nonetheless, finding intuitive ways to visualize large high- dimensional data sets is a difficult problem. Traditional mul- tidimensional visualizations, such as scatter plot matrices, parallel coordinate plots or pixel-based techniques [OL03]
operate by mapping each data attribute into a corresponding visual axis or other representation. As such, they can only handle well a very limited number of attributes and, there- fore, are not directly applicable to complex data consisting of many attributes (dimensions) [MPL06].
A widely explored alternative to handle this type of data is to reduce dimensionality prior to visualization, e.g., by projecting the high-dimensional data points into a lower di- mensional space (2D, 3D) that is more amenable to user in- terpretation. Various techniques exist to achieve such a map- ping, mostly based on dimensionality reduction, dimension- ality clustering, or point placement strategies. The resulting
projection can be displayed through a suitable visual repre- sentation — e.g., points placed on a plane, graphs, surfaces or volumes — that can be navigated and explored by a data analyst. Projections may be created based on different cri- teria, but typically they strive to preserve distance relation- ships amongst data points as defined in the original space.
Information loss is inevitable and the extent to which this distance-preserving goal is achieved depends on the preci- sion of the projection.
Figure1depicts the overall process in generating interac- tive visual representations of document data sets via projec- tions (or, in fact, other point placement strategies too). The user can feed the system with (1) a document collection, (2) a structured data table, or (3) a distance matrix. In the first case, documents are either converted into a vector represen- tation or compared using a similarity measure that handles text against text. From vector representations, distance cal- culations can also be performed. From distance relationships the 2D projection is generated and the user can interact with its visual representation to gain insight to the data set.
This tutorial treats the problem of mapping and exploring, through visual representations, textual data sets and results of information extraction from textual data sets of various different natures. We describe the basic principles for gen- erating such visualizations and make an effort to convey the benefits of achieving a solid graphical view of automatically extracted relationships. The problem of facilitating text anal- ysis is by far an open issue. We hope to identify, suggest and explain the current technologies available and the challenges ahead for professionals dedicated to both using and develop- ing techniques for that purpose. In this tutorial, we offer an overview of the existing techniques underlying techniques to allow the construction of visual document maps from un- structured data sets, that is, without relying on previously extracted meta data. The resulting map should offer insight into the contents of the document data set to support exam- ination and relief the user from extensive reading otherwise necessary.
We begin in the next Section by illustrating the creation and exploration of visual maps of example text collections of very different natures in two quite diverse areas of appli- cation, by use of projection techniques.
Following that we describe, in Section 3, some of the most important basic concepts and techniques involved in this area.
Section4reviews the techniques published in the litera- ture to achieve some of the goals stated here.
Projection based visualizations and their use for visualiza- tion of text collections are detailed in Section5. Finally, we discuss other underlying mining issues that go together with text analysis, such as Topic Extraction.
Figure 1:Generating projection maps of multidimensional data sets.
2. Two test cases: Visual maps of flash news and scientific papers
Visualizations of text documents and their meta-data can take many forms, whether or not they are coupled with data mining techniques. Since our focus is on non-structured data, here we illustrate the mapping based on content of two document collections bearing very different characteristics.
The goal would be, through an analysis process involving such maps, to be able to extract and map features of the data set in a way that can help locate patterns. In each ap- plications, however, what needs to be located is different. In the first case, the mapping of news flashes, the target is to have an overview of the main events that have been reported on Web sites of some relevant news agencies. The second case illustrated here, the mapping of scientific papers, high- lights the possibilities of identifying general subjects as well as specific issues and finding relevant material to explore fur- ther in academic and scientific applications.
Visually, the resulting representation used here is a graph, whose nodes represent points of the data set (individual tex- tual documents) and whose edges represent some type of relationship between the connected nodes. The graph dis- plays and interactions presented in this Section were gener- ated using PEx, the Projection Explorer, a freely available tool for multi-dimensional visualization via projections and point placement strategies with text processing capabilities (see http://infoserver.lcad.icmc.usp.br/infovis2/PEx).
An alternative view of the same text maps is the landscape or surface view, where vertices are points and meshes cod- ify additional attributes by mapping scalar fields to geomet- ric or visual attributes (height, color, glyphs, etc.). Surface views in the following text are generated using the Super Spider tool, a recent evolution of its precursor, the Spider Cursor [MLN∗05], that can be used to explore landscape plots.
The visual representation employed here, whose construc-
tion will be described along the text and slides of this Tuto- rial, can also be the result of various other tools for multidi- mensional data projection (see Section4.2, except, maybe, for particularly visual attribute mappins and for the flying quadric in Figure5.
2.1. Visual mapping of news flashes
The first data set we shall explore was built by gathering RSS feeds of news flashes from the WWW sites of four news agencies: Associated Press, BBC, CNN and Reuters. They represent all the news made available as RSS during approx- imately 24 hours in April 2006. After some elimination of duplicates, the gathering process resulted in 2,684 news files.
For the visual mapping, we have used title and content only, with the goal that similar news (supposedly news treating the same subjects) are mapped to the same neighborhood.
For the exploration after mapping, the full file can be used, which contains also the date and the source of the news (the name of the news agency).
The size of the text in an RSS feed (which, for many ap- plications, synthesizes a larger file) is usually quite small, typically 25 to 80 words only. This test case bears resem- blance to many applications available today that wish to an- alyze RSS feeds of all sorts of information (for instance, dis- cussion groups, support groups, article repositories, patent repositories, dictionaries, libraries, and thousands of other such repositories).
Figure2shows pictures of two different maps from this data set. In Figure2(a), which employs a multi-dimensional projection technique called Least Square Projection, or LSP [PNML07,PNML06], the region labeled A is a group of news articles on the Vioxx dispute. Region B centralizes all news on an immigration bill undergoing discussion at the US senate. Group C concentrates the news on a trial related to the Enron case (a particularly important deposition was
about to take place). By coloring nodes according to the source of the news, it is also possible to identify the degree of importance a particular news agency gave to a subject by es- timating the number of points with its color. Conversely, the same observation supports identification of the news agen- cies that dedicated a lot of — or just a little — effort to report a particular event.
Figure2(b)shows the same map as Figure2(a), slightly zoomed in, now presenting a set of labels that support iden- tification of the subject mostly tackled by a particular user- selected group of documents. The set of labels in that pic- ture, calculated by a term covariance approach (see Sec- tion5.3.1), identify some of the central events of those two days.
Interpretation of groups in this type of projection is very similar to interpretation of clusters, that is, highly similar data is identified by proximity in the 2D display. In the very middle of the display remain the groups that have been ’un- resolved’ by the current context, which means that according to the strategy employed for similarity in this domain, it was not possible to distinguish them from the others. Coloring in Figure2(b)was carried out using a hierarchical cluster- ing method in 2D (in this case k-means with average linkage weighting). It helps distinguish regions of higher density of points (in the ‘blue’ end of the spectrum) from regions of lower densities (moving towards red as regions become more sparse).
Figure2(c)shows the visual result of another map of the same data set, built using a new technique for point place- ment of this type of similarity-based data, called Neighbor- joining Tree (or N-J tree) [CPMT07]. An N-J tree is a type of similarity tree visualization that employs the same prin- ciples of phylogeny tree reconstruction [SN87] to reflect the organization of data points (in this case documents) accord- ing to their similarity. The interpretation of that type of vi- sualization is slightly different than straight projections. In an N-J tree, very similar (according to the similarity mea- sure employed) data points are gathered in the same and nearby branches of the tree. Proximity on the 2D plane it- self does not imply similarity. Interpretation of similarity trees needs, therefore, the branching information to be effec- tive. Branches guide selection and focus also; the neighbor- hood of a point is extracted searching up and down the tree.
Groups of documents with content that is consistent (that is, that are best resolved by the similarity measure employed) are placed at the outer branches. Figure2(c)uses covariance labels to illustrate the placement of some of the main news in the data set.
Closer examination of the flash news visualizations re- veals that many of them are just headlines followed by a short text of the sort ‘read full story for details’. Figure3(a) highlights, in blue, the points where that occurs. Removing those can help unclutter the view, which was done by build- ing a new map after removing those blue points (see Fig-
ure3(b)). Similarly, other searches on the map can help high- light, select or eliminate specific files and groups of files.
The news data set, after the reduction illustrated in Fig- ure3, has 2450 news items remaining. The similarity tree by neighbor-joining of that reduced data set is presented in Figure4(a). That picture also shows a selection of a branch for further examination. The region is zoomed in on in Fig- ure4(b), and further levels of topic extraction can be ob- served. We can see in the central part of that picture that the similarity calculations gathered news on various linked Palestinian issues in neighboring branches. These issues varied from formation of the Hamas government to Israeli strikes in Gaza to halting of financial aid from Europe and the United States. The lighter label in Figure4(b)is the title of one of the news files in that map.
A surface view of a projection-based document map can be built by creating a mesh from the projected points and then using scalar fields attributed to vertices to reflect prop- erties. The result of such a process is illustrated in Figure5.
Various attributes can be mapped to scalar fields, from search results (for instance the number of appearances of a particu- lar expression) to the results of clustering, classification, and categorization of data points. In Figures5(a)and5(b)color reflects a 2D clustering (also by k-means) of the points in the projection, employed to help locate pockets of closely projected documents. Figure5(c)illustrates the capabilities of simultaneous presentation of multiple scalar fields using visual and geometrical attributes. In that picture, height rep- resents the count of the number of times the word ‘Bush’
appears in the document (higher points are higher counts, naturally). Each of the news sources (four in all) is reflected by a different color. That same information is redundantly re- flected on the color of a ‘flying’ super-quadric that changes color as the user browses through the map. The visual and geometric mappings of the super-quadric can reflect prop- erties of the currently explored point (or focus point). In the case of Figure5(c), only its color changes, mapping the news agency that published the document in the focus point. Label by title clarifies what the news is about.
Finally, another type of projection, called Proj- Clus [PM06] is used to map the same news data set explored thus far (see Figure6). Figure6(a)shows the graph view with 2D k-means clustering for grouping location and Figure6(b)shows the mesh view of the same map. In Figure6(b) color is hierarchical clustering, that is, darker blue regions are regions of largest concentrations of points, down to red for sparser regions. ProjClus tends to spread the clusters around more, which is confirmed by larger regions with the same top density. Although that property sometimes confuses the boundaries of clusters, ProjClus still bears good ’proximity by similarity’ precision and can help estimate the local density of a region since it diminishes cluttering compared to LSP.
Both projections thus far employed in the illustrations are
(a) LSP map. Colors represent different news sources (4 in all).
(b) Labeling of regions of map in Figure2(a)for general topic identification.
(c) Tree-like map visualization by neighbor-joining with labels.
Figure 2:Document maps of 2,684 news flashes collected for approximately 24 hours in April 2006.
reasonably fast, leading to an environment were simultane- ous examination and coordination of various maps is feasi- ble. Similarity trees have higher computational cost but it is still manageable for data sets whose size is in the hundreds of points, even in the thousands, if similarity is pre-calculated and stored. Neighbor-joining algorithms are typicallyO(n3) orO(n2)at best.
2.2. Visual mapping of scientific papers
The second data set to be illustrated here is composed of 676 papers in four areas of knowledge: Case-based Rea-
soning (CBR), Inductive Logic Programming (ILP), Infor- mation Retrieval (IR) and Sonification (SON). CBR and ILP documents were extracted from Lecture Notes in Com- puter Science (LNCS) in those subjects. IR and SON papers were retrieved from internet repositories. They are pseudo- classified according to their source. The maps were built us- ing papers’ titles, authors, affiliations and references.
In addition to those documents, six others were added to the data set. These were papers of work published by mem- bers present and past of our research group; the goal was to evaluate their local and global positioning on the map.
From those, five papers were highly related. They refer to
(a) Highlighting news with words ‘full’ and
‘story’ on map of Figure2(a) (b) New map without news highlighted in Figure3(a). 2450 news files remained Figure 3:Using search and corpus manipulation to eliminate unwanted files
(a) N-J tree of the reduced news data. Selection of focus region.
(b) Scaled presentation of focus region and its surroundings . Figure 4:Focusing on a part of a Map for further exploration
the evolution of a sonification system and its corresponding user evaluation. Two of them refer to the first version of the system and the other three to its newest version. The last pa- per was also published having as authors members of our research team, but on a completely different subject (a tech- nique used for image segmentation). Let us call those six additional papers the ‘intruders’.
Figure7shows general and specific views of the papers data set, mapped by LSP. It can be seen in Figure7(a), both by coloring and by labeling (each color here represents a pseudo-class), that the projection technique was capable of separating well this heterogeneous document collection by their subject areas. Moreover, subjects with an established and more focused body of techniques, such as CBR and ILP
(a) Mesh generated from a Delaunay triangulation of news map in Figure3(b). Color is k-means clustering of the pro- jected points.
(b) Surface view of news set. Color and height codify the same scalar field of3(b).
(c) Surface View. Color reflects the source of the news; height codifies the number of appear- ances of the word ‘Bush’ in that particular text; the color of ‘flying’ geometry on the display confirms the source of the news (i.e., the news agency here represented in red) of the focus point; the focus is highlighted by its incident edges colored in blue; label is title of the focus point.
Figure 5:Surface representation and exploration of content by visual attributes.
tend to be displayed less sparsely than a newer more diverse area such as Information Retrieval. In the various regions where points of different colors appear close to one another, further examination reveals that there was a high content cor- relation that lead to that proximity. Most of the time they are applications of techniques in those areas (CBR, ILP, IR, SON) for particular goals (such as modeling, manipulation, or study of particular data sets such as biological sequences).
To check that capability further, it can be seen in Fig- ure 7(b) that all except one of the intruder papers were mapped in the same group. The five sonification papers share a very close neighborhood within the general sonifica- tion group, as would be desired. The sixth, unrelated paper, joined another separate group, that seemed to gather together
papers from different subject areas. Closer examination by focussing on that particular cluster and loading the docu- ments represented there (see Figure7(c)) shows that these papers deal with various techniques (in all subject areas) de- signed to treat imaging tasks (such as retrieval, manipulation and segmentation). In fact all papers in the paper data set dealing with imaging techniques were placed in that group.
Connections amongst nodes on the graph can carry vari- ous meanings. For instance, they may reflect neighborhoods within the data set. Figure8illustrates the display of simi- larity relationships also as edges on the graph. In Figure8(a) documents are connected according to the similarity mea- sure employed for the creation of the map. Each point is con- nected to its next neighbor in the similarity matrix. It can be
(a) ProjClus map, which spreads groups on the 2D plane; color reflects 2D clus- tering
(b) Mesh view of ProjClus news map; color reflects levels of concentration of points.
Figure 6:Map by ProjClus of the reduced news set.
seen that there are a few points whose next neighbors are not within the same general group. That is to be expected, since the main groups are not at all disjunct as far as content sim- ilarity is concerned. Still, the small number of ’crossings’
outside the clusters indicates the good degree of grouping presented by the projection. New connections can be added.
For instance, each point can be connected to its two nearest neighbors in the original data space (see Figure8(b)); in this case more connections across groups appear, and they can reveal further associations apart from those already revealed by grouping and first neighbors, adding to the possibilities of exploration.
Many other relationships could similarly be mapped to edges on a map. Focussing on parts of the map shows in fur- ther detail the subjects within groups. Detail of two groups of the map (see Figure8(d)) reveals that, within the general sonification group, there are separate groups for audio pro- cessing (including retrieval), computer music, sound in user interfaces, data sonification, and sonification for the visually impaired.
Display on the 2D plane of multidimensional data can be done using a number of techniques. As well as projections, other point placement strategies can be employed. Some- times it is useful to arrange (or rearrange) a projection ac- cording to the connections amongst points. That is illustrated in Figure 9(a)and 9(b). Other important issues in docu- ment map exploration are topic extraction and visualization, which help the user find interesting groups of documents to explore further. Figure 9(c)illustrates that using a proper technique (selective generation of association rules, in this
case), further detail of topics approached by documents can be viewed.
Finally, we map the scientific paper data set using N-J trees. Since the tree reflects an organization of the similarity measure, it reveals subtleties not easily identifiable in pro- jection layouts. For instance, in Figure10(a)and10(b), just by having two different approaches to display the same tree, it is possible to locate different levels of grouping and sepa- ration of groups. Additionally, in the same pictures it can be seen that the positioning of the ‘intruder’ papers in the same neighborhood happens in two levels for those dealing with sonification. The two papers referring to the first version of the sonification system are gathered in the same branch while the others are together in another branch nearby. Fig- ure10(c)shows an N-J tree map built from another similar- ity measure, in this case Normalized Compressed Distance (NCD) [TMP07,CV05], see below for more details. Over- all NCD gives similar separation of greater groups for this same data set. However, close examination shows that not all relationships are maintained. For instance, in that map all the intruders were, according to that NCD, deemed more similar than using vector representation. This type of presen- tation gives a good reflection of how a particular similarity measure behaves with a data set.
The cases shown here illustrate the use of projection- based displays together with mining techniques (such as clustering, covariance term extraction and association rules extraction) in an integrated manner, with combined available technology to build maps for exploration of text collections.
These and other tools put together can support improvement
(a) LSP map with main topic labels, automatically extracted. (b) ‘Intruders’ papers, five on sonification and one on a segmentation technique, are displayed in green in the highlighted re- gions.
(c) File content view of papers in group that gathers techniques for imaging.
Figure 7:Map of scientific papers in four classes, CBR, ILP, IR and SON, using LSP.
in examination of larger text collections without extensive reading.
The next section starts detailing concepts and techniques related to the construction and exploration of such maps.
3. Basic Concepts
3.1. Text processing and information retrieval 3.1.1. Introduction and topics
We summarize the fundamentals of text processing and information retrieval here. The main topics to be explored are:
• Modeling for information retrieval
• Retrieval evaluation methods
• Query languages and operations
• Text and multimedia languages and properties
• Text operations
• Indexing and searching
Whereas this include non-text aspects, such as multimedia languages and properties, we leave those here to provide a more comprehensive picture of the subject.
3.1.2. Motivation
Information Retrieval (IR) has become a major factor in today’s information-centered world. Whereas in the past, only very technical people used to interact with data sources
— actually, mostly databases — today lay people used data sources on a daily basis. Every Web search, every Web-based purchase, and many other application on- and off-line, all fo- cus on data sources and data bases, whether the user is or is not aware of it.
3.1.3. Information versus data retrieval
3.1.3.1. Data retrieval : Data retrieval interacts with databases, which are sources of structured data. Databases do not usually contain broad information about a specific
“subject or topic”, but, rather, store specific data in struc- tured tables, such as catalogs, populations data, and the like.
Retrieval from a database requires some knowledge of a query language — the most prevalent variations on SQL, the Standard Query Language. Data retrieval will only succeed if the database includes specific instances of precise query terms. A slight misspelling can make the difference between hit and miss, between success and failure.
(a) Connection by 1-neighboring points according to similarity measure.
(b) Connection by 2-neighboring points according to similarity measure.
(c) Selection of part of the map
(d) Focus on the part of the Map selected in Figure8(c), with labeling by covariance
Figure 8:Maps of scientific papers with connection representing similarity in the original space.
3.1.3.2. Information Retrieval (IR) : IR, on the other hand, focuses on retrieval from unstructured of documents that are related by topic or subject. Query terms are not nec- essarily expected to be present in their exact form in a doc- ument for it to be returned in the result list. Instead, IR sys- tems attempt to interpret the contents of information items
— mostly, text documents in a collection. This interpreta- tion involves a comparison between the entered query and the documents in the collection, computation of a “differ- ence” or “distance” measure, and ranking of the returned documents according to their relevance to the user’s query.
3.1.4. Information retrieval at center of stage
Information Retrieval has been utilized by librarians and other information professionals for close to three decades.
However, with the penetration of the World-Wide Web, it is now one of the most important tools a wide variety of users utilize in their day-to-day interactions with informa- tion. The Web has become the single largest repository of, as well as the unified interface to most of the information people have to interact with. However, the Web presents a
significant obstacle to effective and efficient information har- vesting, namely, the absence of a well-defined underlying data model for Web-based information. This typically leads to information definition and structure that is frequently of low quality.
3.1.5. Basic concepts of IR
The IR process starts with a user’s need. The user’s initial task is to translate her information need into a query, spec- ified in a language that is provided and understood by the system. We have all grown accustomed to entering multi- ple daily queries into our favorite search engine. If the user’s query is entered into an IR system, she typically would spec- ify a set of “key words”, which are expected to convey the semantics of the information needed. Using a more rigid data retrieval system, a much more rigorous query expression is required, such as a regular express ion, which contains con- straints to be satisfied by objects in the answer set. In both cases, the user searches for some useful information execut- ing a retrieval task.
(a) Rearranging the position of the subgraph of Figure8(c) by using a force-directed placement strategy.
(b) Labeling by covariance of main topics of detail map.
(c) Detailed topic extraction by association rules [LPMP07].
Figure 9:Exploration of a sub-graph with rearrangement and topic extraction and visualization.
A schematic overview of the basic iterative retrieval pro- cess, the logical view of a document, and a more detailed picture of the retrieval process, can all be found in the ac- companying slides.
3.1.6. A brief history of Information Retrieval
Here, we summarize the history of Information Retrieval, covering the following topics:
1. Early developments;
2. Information Retrieval in the library;
3. the Web and digital libraries.
3.1.7. Early developments
For roughly 4000 years, people have been organizing in- formation to be able to later retrieve and use it as they need.
One of the most common examples is the familiar Table of Contents found in many books and other text documents.
After Gutenberg’s invention of the moveable printing press in 1455, which has made books available and affordable to the general population, rather than just to the wealthiest few who could afford them, the volume of information people have acquired grew beyond just a few books. This, in turn, triggered the development of specialized data structures for faster access to stored information. For example, an old pop- ular data structure for faster IR still found in many books is the index.
An index is a collection of selected words and concepts that serve as associated pointers to related documents and other information sources. Indices are at the core of any modern IR system. They provide faster access to data and they speed the task of query processing. For centuries, in- dices (aka indexes) were created manually, according to cat- egorization hierarchies. Today, libraries still use categorical hierarchy — which have usually been conceived by human subjects from the library sciences field — to classify vol- umes and documents.
Computers have lead to the automatic construction of large indexes, which has lead to another view of the retrieval problem, one that is much more related to the system itself than to the user’s needs.
We, thus, face two views of the IR problem, a Computer- centered view, and a human-centered one.
3.1.8. A computer-centered view of the IR problem As far as computer systems are concerned, the IR problem has the following components:
• build efficient indexes;
• process user queries with high performance; and
• develop ranking efficient algorithms.
All, with one goal in mind: to improve the ‘quality’ of the retrieved answer set .
(a) N-J tree based on vector rep- resentation using cosine distance;
initial position
(b) Vector based similarity tree, after repositioning the nodes.
(c) N-J tree from NCD similarities.
Figure 10:Similarity trees for the scientific papers data set.
3.1.9. A human-centered view of the IR problem From the human user’s point of view, the IR problem can be summarized as:
• Study the typical behavior of users;
• understand their main needs; and
• determine how such understanding affects the organiza- tion and operation of retrieval systems.
From this point of view, a keyword-based query process- ing might be seen as unlikely to yield a good solution in the long run, leading to the quest for other, more promising al- ternatives.
3.1.10. Information Retrieval in the library
Libraries were among the first to adopt IR systems. Those systems were developed initially by academic institutions but later by commercial vendors started offering more re- fined tools and systems. We observe first, second and third generation features in those systems.
3.1.10.1. First generation . The first generation of library IR systems was basically an automation of previous tech- nologies, mostly card catalogs. They allowed searches based on author name and title only.
3.1.10.2. Second generation . Here, increased search functionality added search by subject headings, keywords, and some more complex query facilities.
3.1.10.3. Third generation . More recently deployed, these systems focus on improved graphical interfaces, elec-
tronic forms, hypertext features, and open system architec- tures.
3.1.11. The Web and digital libraries
The Web has become the primary source of all kinds of information. With the spiral growth of the amounts of infor- mation available on the Web, search engines have become among the most important Web tools — and assets. Today’s search engines continue to use indexes that are very simi- lar to those used by librarians a century ago. So what has changed?
We observe three dramatic and fundamental changes:
First, access to various information sources has become a lot cheaper, reaching wider audiences than ever possible be- fore. Second, advances in digital communications have lead to easier, cheaper, and thus greater access to networks, al- lowing distant, quick access to vast amounts of information throughout the world. And, third, the freedom to post what- ever one would like to, has caused the Web to become ever more popular. For the first time in history, most people have an almost free access to a large publishing medium, mak- ing the Web (and modern digital libraries) a highly inter- active medium, facilitating the exchange of messages, pho- tos, documents, software, video; making interactive chatting convenient and at low (or no) cost, cause overall a funda- mental shift in the current communication and information flow paradigm.
3.1.12. The future: three main questions
We are still not through; we do not have all the knowledge, all the answers. The following still need to be addressed:
First, despite the high interactivity, people still find it dif- ficult — impossible at times — to retrieve the information that would be relevant to their information needs, leading to the question “which techniques have the potential to yield retrieval results of higher quality?”
Second, ever increasing demand for access is causing the need for quicker response to be more and more a pressing factor. We therefore need to find out which techniques can support faster indexes and smaller query response times?
And,
Third, the quality of the retrieval task is greatly affected by the user’s interaction with system, so “how can better un- derstanding of the user’s behavior affect the design and de- ployment of new information retrieval strategies? ” 3.1.13. Practical issues
The following issues have also great influence on the en- tire IR problem and solution set.
• Electronic commerce, a major trend on Web;
• security;
• privacy;
• intellectual property rights and publisher responsibilities;
• internationalization; and how to deal with multimedia (images, sound, video, etc.)
3.1.14. Overview of the rest of IR topics
There are many components to IR. We list here the major topics; these remain, however, outside the scope (and time and space limitations) of this tutorial. The reader is encour- age to explore these further. A very comprehensive treatment
— and, thus, a good place to start — would be [BYRN99].
1. IR modeling;
2. retrieval evaluation;
3. query languages and operations;
4. text and multimedia languages and properties;
5. text operations;
6. indexing and searching;
7. parallel and distributed IR;
8. user interfaces and visualization; and 9. searching the Web
3.1.15. Text preprocessing for IR, mining, and visualization
In order to reduce the space dimension and to allow re- fining the text model, text is processed prior to extraction of the vector representation. This process typically involves three steps: (i) removing stopwords, i.e., non-informative words, such as articles, prepositions and such, plus any words known to lack relevance to the context; (ii) stemming, which reduces words to their radicals (e.g., ‘motivation’,
‘motivate’, ‘motivating’, would be all reduced to ‘motivat’);
and (iii) frequency counting, so as to remove terms that oc- cur too sparsely or too often and hence have little differential
capability. Setting suitable frequency thresholds for discard- ing terms typically demands user knowledge and interaction.
Corpus vector representation enables handling the map gen- eration problem as one of mapping objects defined in a high dimensional space into a 2D (or 3D) visual representation space. Clustering and projections are typical approaches to handle the problem. Determining vector similarity — a nec- essary step for both processes — requires defining a metric in the high dimensional vector space that allows computing distances between vectors. The distance between any pair of document vectors is a measure of their proximity, or se- mantic similarity. Distances for all pairs can be stored in a triangular distance matrix of dimensionsnxn, wherenis the number of documents.
An alternative to the vector space model has been pro- posed based on Kolmogorov complexity approximation [TMP07], called Normalized Compressed Distance (NCD).
NCD computes a distance between a pair of documents straight from their textual content, rather than from the inter- mediate vector representation. The Kolmogorov complexity of a stringx, denotedK(x), is the size of a description of xproduced by an optimal specification method. The condi- tional Kolmogorov complexityK(x|y)ofxwith respect to another stringycan also be defined as the amount of infor- mation thatxdoes not have abouty.
An information distance between two given strings can be specified based on their conditional Kolmogorov com- plexity. Though the problem of evaluating a string’s Kol- mogorov complexity is non-computable, a solution can be approximated through compression; this yields the basis for the definition of NCD. We include a technical report in Ap- pendixAon a formulation of NCD for text collections at the end of the Tutorial notes.
Other alternatives for pairwise similarity calculation, par- ticularly those based on information theory, have also being studied recently [AF03].
3.2. Data and text mining
3.2.1. The knowledge discovery process
Data Exploration is the process of searching and analyz- ing databases to discover implicit but potentially useful in- formation, mostly for the purpose of supporting the decision making process. The goals of data exploration are:
• Convey information;
• discover new knowledge; and
• identify structure, patterns, anomalies, trends, and rela- tionships.
Major Data Mining Tasks, such as, summarization, asso- ciation, classification, prediction, clustering, and time-series analysis use major techniques, including Linear Regression Trees, Non-Linear Regression, MARS, Naive Bayes, K- Means and K-Median, Neural Networks, Association Rules,
Decision Trees, Principal Component Analysis, Support Vector Machines, and Genetic Algorithms. These, in turn, are based on statistical tools, such as, Missing Value Impu- tation, Normalization techniques, Error & Variational Anal- ysis, and Confidence Estimates, to list just a few.
3.3. Projection techniques
Data sources have increased substantially both in size and complexity, but extracting useful information from them is still a challenge. One measure of data complexity is the num- ber of attributes associated with each instance of data. Con- sider, for example, data from a demographic census: a data instance records attributes such as age, sex, education, oc- cupation, income, and so forth. Considering each data at- tribute as a data dimension, if we havemsuch attributes each data instance can be interpreted as anm-dimensional vector placed in anm-dimensional definition space.
In traditional statistical analysis, data instances with three or more dimensions are known as multivariate or hyper- variate data. In Information Visualization such data is usu- ally referred to as multidimensional. Conventional methods, such as scatter plots or bar charts, normally employed to as- sist data interpretation, are not directly applicable to mul- tidimensional data. Moreover, identification of patterns and models grows more difficult as dimensionality increases, and lack of proper representations can severely impair interpre- tation.
A common way to handle dimensionality is to reduce the number of dimensions, so that strategies that are known to work well with low-dimensional data can be applied.Multi- dimensional Projectiontechniques are one example of such a strategy. A multidimensional projection technique typi- cally maps the data into a p-dimensional space with p= {1,2,3}, while retaining, in the target space, some infor- mation about distance relationships among the data items in their original definition space. In this way, a graphical representation can be created to take advantage of the hu- man visual ability to recognize structures or patterns based on similarity, such as clusters of elements.
Formally, let X = {x1,x2, . . . ,xn} be a set of m- dimensional data, with δ(xi,xj) a dissimilarity (distance) measure between twom-dimensional data instances, and let Y ={y1,y2, . . . ,yn}be a set of points in a p-dimensional target space, with p={1,2,3}andd(yi,yj)a (Euclidean) distance between two points of the target space. A multi- dimensional projection technique can be described as a bi- jective function f:X→Y that seeks to make|δ(xi,xj)− d(f(xi),f(xj))|as close to zero as possible,∀xi,xj∈X.
3.4. Visual representations: graphs, surfaces, volumes, triangulations
Many visual representations have being utilized to repre- sent data. Among the most commonly used ones one finds
maps, graphs and networks, surface- and volume representa- tions. The use of maps is so widespread that it is not neces- sary to discuss them, as every one knows their utilization for geographical applications. The concept has been used also to represent data in which some form of “proximity”, whether in physical space or as determined by other mea- sures, is needed to be presented.
Whereas maps are particularly useful to represent physi- cal proximity or its conceptual equivalent, graphs and net- works have been primarily utilized where the relationships among participants are to be emphasized. In these types of situations, measurable physical is not important; what is im- portant is how nodes are related to each other. For example, the communication patterns among members of an organi- zation can be nicely visualized by a graph, in which a link between two nodes represents the communication between them. A directed link can show who initiated the commu- nication. A network of such communications can quickly show influential figures, such as “authorities” (those who are considered knowledgeable about a variety of subjects) and
“hubs” (those are know who to seek when a question needs to be answered). The mapping of authorities and hubs have been the cornerstones of the original “page rank”, the algo- rithm that provided the initial implementation of Google’s search engine. In addition, graphs and networks are com- mon representations for processes in which transitions occur between states, under certain conditions.
Spatial data — and, at times, data that can be mapped to a spatial representation — have been often represented by vol- umes, or their surfaces. Volume representations preserve the values of data throughout the volume, whereas surface rep- resentations discard internal values and only preserve values of the external surfaces selected to be displayed.
The two most crucial aspects of either are the data struc- tures utilized to represent the data, and rendering techniques to display the data on a screen. Multiple data structures have been proposed for storing data of three (or higher) dimen- sions. Among them, in addition to the straightforward carte- sian coordinates, are various hierarchical representations, in- cluding trees of various type. The discussion of such data structures are beyond the scope of this tutorial; the reader is referred to the standard data structure literature for details.
Surface representations have included various triangula- tion techniques, as well as cuberille approaches. Triangula- tion techniques approximate the surface with a mesh of tri- angles that attempt to follow the surface as closely as possi- ble. Cuberille techniques try to approximate the surface with a mesh of squares rather than rectangles. Some use graph concepts to implement optimal coverage.
The display of three- or higher-dimensional data on a two- dimensional display requires projection first. Surface and volume rendering techniques have been covered extensively in the computer graphics and visualization literature. The
reader is referred to this literature to familiarize her/himself with the technical details entailed.
The accompanying slides provide some examples of vari- ous of these techniques.
4. From Visualization to Visual Text Mining
4.1. Visualization techniques for multidimensional data
“One picture is worth a thousand words” is an old cliche, but it is based in the fundamentals of human perception and processing of information. Sixty percent of all input for deci- sion making comes into the brain of a normal-visioned per- son from the visual system. It has been demonstrated that, in the event of contradicting information between the visual input and that of another sense (e.g., tactile), the visual stim- ulus with “win” over the other one, even if the other sense is providing correct information and the visual input is erro- neous!
Graphics and Visualization are meant to help the user see (understand), remember, compute, analyze, discover, enjoy, and much much more.
Multiple goals are served by different types of visualiza- tion tasks, and the appropriate techniques. The most com- mon visualizations used are for the purpose of presentation of known information. Here, facts to be presented are known (though they may not represent the truth), the visualization process is to choose and tune the appropriate visualization technique, and the result is a high-quality visualization of the data and analysis to present facts (often without the au- thor’s presence). Confirmatory analysis visualization starts with some hypotheses about the data, proceeds through a goal-oriented examination of the hypotheses, and results in a visualization of the data to confirm, accept, or reject the hypotheses. Finally, exploratory analysis starts with no hy- potheses about the data, proceeds with an interactive, usually undirected search for structures, trends, patterns or anoma- lies, and yields a visualization of the data to lead to some hypotheses about the data.
All of these utilize a wide variety of technique, some are pure visualizations, some are integrated with analysis, all utilizing to one degree or another an assortment of interac- tion tools, which help the user control the process and inter- act with it to yield optimal results.
Some pure visualization techniques include 2D and 3D Scatterplots, Matrix of Scatterplots, Statistical Charts, Line and Multi-line Graphs, Parallel Coordinates, Circle Seg- ments, Polar Charts, Survey Plots, Heatmaps, Height Maps, Iconographic Displays, RadViz, PolyViz, and many more.
Among those that are integrated with analysis one finds Projection Pursuit, Dimensional Stacking, Sammon Plots, Multi-Dimensional Scaling, Principal Component Analysis (PCA) and Principal Curves, and Self Organizing Maps.
Interactions include Selection, Probing, Querying, Grand Tours, and Non-linear Zooms.
It would require hundreds (if not thousands) of pages and hours to discuss all of these in even a moderate amount of details, so we will limit our discussions here to only the very basic concepts, and to a few methods that are particularly relevant to our main focus, Visual Text Mining.
4.1.1. The visualization problem
The visualization problem can be summarized as: Massive amounts of data from various sources, including databases, simulations, sensors, decision systems, and more; limited screen space; and little knowledge about the human percep- tual system and the process of information transfer.
Visualization is a method of computing. It trans- forms the symbolic into the geometric, enabling researchers to observe their simulations and com- putations. Visualization offers a method for seeing the unseen. [MDE87]
Visualization now includes other data representations, such as auditory, haptic and tactile, potentially more in the future.
4.1.2. Classifications of visualization techniques We are primarily interested in those visualization tech- niques that are suitable for multi-dimensional data sets, as visual text mining deals in high dimensional data.
Many classifications have been proposed for multi- dimensional data visualization techniques.
4.1.2.1. Point-based Point-based techniques map the di- mensions of a data record to the attributes of a point on the display, including its size, shape, color, motion, and — oc- casionally — sound at the point. The most recognized point- based technique is the scatterplot. Scatterplot-based visual- ization techniques include OmniViz Galaxy (OmniViz, Inc.), Temple MVV Graphics [MGTS90,MTS91], [CC92], and Splat Visualizer and Scatter Visualizer (from Purple Insight).
Because the traditional scatterplot is essentially a two- dimensional data technique, various approaches have been propose to extend scatterplots to higher dimensions. These approaches include layout extensions, either via matrices or radially, and shape extension, utilizing icons and glyphs.
Scatterplot Matrices provide an xy layout of k- dimensional data in a total ofk(k−1)/2 scatterplots, each showing two of thekdimensions against each other. Clearly, askgrows, the number of scatterplots grow too. E.g., for k=100 the number of scatterplot will be 4,950. The con- cept has even been extended to scatterplot cubes, following the same idea.
RadVis [Hof99] is a point-based technique that organizes
data dimensions around a circle. Data dimension values function like springs, exerting forces over a data record in various directions. The result is a collection of point clouds, clustered based on dimension values.
Hyperslice is a matrix ofk2slices through ak-dimensional data set. Slices can be determined interactively [vWvL93].
4.1.2.2. Line-based Line-based techniques include bar charts, line graphs, parallel coordinates.
4.1.2.3. Hierarchical, graphs, trees These methods uti- lize hierarchical structure, networks, graphs, and trees to present data that is inherently suitable for such structures.
Among the best known technique one finds Eick’s SeeNet [BEW95], which utilizes spoke, helix, and sphere layouts to plot networks of nodes and their relationships; SeeNet 3D [Eic96], which is an extension of SeeNet to three di- mensional rendering; SeeNet ArchView, a member of the SeeNet family, which uses arches as the visual primitives;
MBone [MB94]; and DBMiner [HCC∗97].
4.1.2.4. Iconographic displays Iconographic displays are perceptually driven displays. They generalize the notion of a pixel to that of an icon (or a glyph). By doing so, they in- crease the number of parameters “displayed”. Icons’ visual, color, and auditory attributes are mapped to data parame- ters, thus making the icon “data-driven”. The icons are dis- played en-masse and thus harness the perceptual powers of the human early visual system. When presented, icons are available for user interaction. These methods have been suc- cessfully used in data fusion of multiple parameters / dimen- sions.
Among the best known iconographics/glyph systems are Chernoff faces [Che73], Andrews’ graphs, Stars or circular plots (Ward et al., various), Stick-figure icons [PG88], Color and Texture [H.91], Beddows’ Embedded blocks, Smith’s Sound (1989-90), and numerous other examples dated dur- ing the nineties.
The accompanying slides provide many examples of these various visualizations.
4.2. Visualization techniques and systems for handling document collections
Several different techniques for visualization of textual results from Web and other searches have been proposed ( [ABY03,LA00,SCL∗99,BY96]). While these techniques are capable of displaying large document bodies, they tend to make the location of specific relevant reading material diffi- cult. We focus here on complementary tools to support map- ping of documents in a way that helps locate neighboring similarities between individual text documents and groups of documents. We therefore assume a pre-filtering task that re- duces the universe of targeted documents to a few hundreds
(maybe thousands) in a few areas of interest (not necessarily pre-determined).
Many techniques for text document analysis and visual- ization exist. They usually search for a representation of the content of an individual document or text fragment (e.g., [MWBF98], [RES98], [RES99]), of document collections (e.g., [LA00], [BCG∗99], [Wei01]), or of themes approached in text documents (e.g., [HHWN02], [Wis99], [WTP∗95]) in order to meet these goals.
The process of text document analysis and visualization usually involves three phases: (i) pre-processing; (ii) dimen- sion reduction; and (iii) attribute mapping to (at least) a vi- sual representation and presentation.
The pre-processing phase takes as input the document col- lection to be analyzed, and produces an intermediate form, usually based on the vector space model [Sal91] whereby documents are represented as points in a vector space. In this representation each document is represented by a vector whose dimensions are terms (n-grams). The vector coordi- nates are the weights of the terms based on their frequency of occurrence in the document. Typically, dimensions reach the thousands even for small to medium-sized databases. Trans- formation of a document collection into a vector space is preceded by elimination of non-influential words (such as stopwords), reduction of words to their radicals (stemming), and some sort of frequency counting (various exist).
The second phase, dimension reduction, typically in- volves removal of words that are either too frequent or too rare in the collection, and clustering dimensions to generate new ’combined’ attributes.
The most common way to extract structure from a text document collection is by applying some dimensional reduc- tion technique over the resulting vector representation. Sys- tems that implement such approaches include those based on Multi-dimensional Scaling (MDS), Principal Component Analysis (PCA) or Latent Semantic Analysis (LSA), which work with statistical measures for subspace reduction, as well as Self-Organizing Maps (SOM), which employ neu- ral computation ( [Wei01], [Wis99], [BCG∗99], [KHLK98], [WTP∗95]).
In the third phase, such techniques can be used to reduce the dimension to two and then plot the data onto 2D space.
However, multidimensional reduction techniques can cause some difficulties, such as [HWR03]: high informa- tion loss when applied directly to two dimensions (for dis- play); reduction in input dimensions do not seem to affect greatly the outcome; and there is an inherent discretiza- tion problem associated with techniques such as SOM, by which individual documents in groups are not distinguish- able. For our goal here, dimension reduction can pose an additional problem: when used to display the results in 2D, the mappings to subspaces may define groups of ‘similar
documents’, but it is not possible to locally relate neigh- boring documents. Previously [LMMP06], LSA has been successfully applied to generate document maps with high local content relationships, but the high computational cost remains a problem as well as the handling of vector trans- formations. Minghim et al. [MPL06] have proposed faster mapping approaches that posses the ability to associate doc- uments by similarity, based on dimensional reduction by fast projections (such as Fastmap [FL95] and Nearest Neighbor Projection (NNP) [TMN03]. The gain in processing time is attained by using these projection techniques, which provide an initial point placement that is prone to speed up force- based improvement schemes. Those too are based on the vector representations. Those projections provide an initial placement on the visualization plane, but also require fur- ther use of point placement strategies [TMN03] in order to recover information lost during projections.
Point placement strategies, including force-based point placement, have been used before to generate document dis- plays [Cha96,AKS∗02a,GSAK04]. They can avoid, partially or completely, the extensive calculations needed in dimen- sional reduction techniques by starting with a semi-random point placement and re-adjusting their position based on at- traction by similarity. However, our experience trying to find relations in their application to text content has been that the precision of the placement is limited as applied to our goals for these maps.
Other major techniques to visualize inter-document simi- larities are document networks ( [FFW91], [TC89]), spring embedings ( [CC92], [SA98]), and document clustering ( [AOL93], [HP96], [LC96], [ZE98]). Of those, only docu- ment clustering is sufficiently fast and produces intuitive re- sults to warrant consideration ( [ZE00], [HP96], [Zam98], [ZK05], [CHR03], [RK04]). It is often employed in com- bination with dimensional reduction and SOM ( [IR01], [Wei01], [LA00], [Koh97], [Lin91], [Mer97]).
Document clustering techniques provide a way to relate documents to each other, as well as to a set of topics. A doc- ument is assigned to a cluster if it contains the contents of the cluster’s topic, and is generally similar in contents to that topic and to the other documents in the cluster. Documents in the same cluster are expected to be more similar to each other than to documents in other clusters [ZK05]. Since doc- uments may contain multiple topics, some approaches allow a document to be assigned to more than one cluster, poten- tially resulting in overlapping clusters. Usually, intra-cluster relationships are not provided as part of the results. However, they are very useful to provide general overviews of large collections, although they usually have to be interpreted by users with certain level of expertise.
Some approaches completely avoid the high dimension- ality problem by simply ordering the most used terms in the document and employing the first N terms [RES98], [RES99]. These strategies work well for single document
representation and for association of a limited number of documents, and even for some degree of clustering. How- ever, they also lack a way to clearly relate different doc- uments and display levels of similarity. Other approaches (such as the one by Carey et al. [CHR03]) combine a num- ber of different strategies to allow various views of the same document set, potentially improving focusing and analysis tasks.
Maps resulting from the techniques mentioned above are meant to present various properties of documents, includ- ing content similarity, co-citation, term co-occurrence and attribute-value matchings, such as author or publication date.
For a detailed description of available techniques for text document mapping and its applications, systems and chal- lenges, we refer the reader to Borner et al. [BCB03].
Few systems for general multi-dimensional visualization add projections in their tool suite. TheHybrid Information Visualization Environment(HIVE) [RC03], however, imple- ments clustering and layout algorithms to enable exploratory visualization of multidimensional data. An intuitive config- uration process by the user is its main strength. Neverthe- less, the layout algorithms implemented are approximations of the original FDP model, which fail to generate visual- izations that group (separate) the data items based on their similarity (dissimilarity).
Projections have nevertheless been intensively explored in the context of visualizing unstructured data, particu- larly text document collections. Two typical layout ap- proaches in this context areInfoSky[AKS∗02b,GSAK04]
andGalaxies[WTP∗95,Wis99], the later one incorporated into the IN-SPIRET M system (see http://in-spire.pnl.gov/).
Both systems display documents as points in 2D space following a similarity-based layout. Clustering and projec- tions are employed to position groups of documents and to reduce the number of distance calculations. IN-SPIRE builds that into a dimensional reduction approach, similar to Single Value Decomposition or Latent Semantic Index- ing [PRTV98,DFK∗04]. InfoSky also creates a hierarchical display by embedding structure into the similarity relation- ships, so a user can focus in and out analogously to manipu- lating a telescope.
Within our research group, we have also devel- oped — and made freely available — a tool suite for multidimensional visualization based on pro- jections, called PEx (the Projection Explorer — http://infoserver.lcad.icmc.usp.br/infovis2/PEx). PEx is an experimental platform written in Java that generates vi- sual maps of multi-dimensional data sets based on classical and novel projection and point placement strategies, with particular focus on text processing.
One way to measure the similarity between text doc- uments in our case is by calculating cosine the dis- tance [MPL06] between the vector representations. The other is using the “Kolmogorov distance” [TMP07]. This