Visual Mining of Text Collections

(1)

Visual Mining of Text Collections

Rosane Minghim¹and Haim Levkowitz²

1ICMC - Instituto de Ciências Matemáticas e de Computação University of São Paulo

Brazil

2Institute for Visualization and Perception Research University of Massachusetts Lowell

USA

Abstract

What happens if you have to examine and reach conclusions from a considerable number of textual documents? If you are faced with this task or with developing tools for completing this task, this tutorial is for you.

Examining text is crucial for many different types of applications. Even applications that rely on additional types of data (such as images, signals, simulations) usually have complementary or alternative text based output. The challenge of interpreting content and extracting useful information from a document collection is the target of efforts in various areas of computer science. Fields such as Text Mining try to extract knowledge automatically or semi-automatically from collected textual information; however, due to the multi-dimensional characteristics of text it is paramount to couple these algorithms with meaningful visual representations in order to improve performance and allow the discovery of relevant information within a text data set.

Since it is not feasible to go through the entire documents’ content in detail due to data sizes and time constraints, Visual Text Mining (VTM) — the combination of Text Mining and Visualization — is focused on developing tools to help users extract meaning from text collections without extensive reading.

In this tutorial we introduce the necessary background and the graphical techniques involved in Visual Text Mining of document collections.

(2)

Contents

1 Overview, motivation, goals

2 Two test cases: Visual maps of flash news and scientific papers 2.1 Visual mapping of news flashes

2.2 Visual mapping of scientific papers 3 Basic Concepts

3.1 Text processing and information retrieval 3.2 Data and text mining

3.3 Projection techniques

3.4 Visual representations: graphs, surfaces, volumes, triangulations 4 From Visualization to Visual Text Mining

4.1 Visualization techniques for multidimensional data

4.2 Visualization techniques and systems for handling document collections 4.3 Visual text mining

5 Projection Based Visualization and its Application to Visual Text Mining 5.1 Projection techniques and point placement strategies

5.2 Mapping text collections via projections and point placement 5.3 Topic extraction and visualization

5.4 Further Examples

6 Conclusions, Current Challenges, Future Trends 7 Acknowledgments

Appendices

A Technical Report: VISUAL MAPPING OF TEXT COLLECTIONS USING AN APPROXIMATION OF THE KOLMOGOROV COMPLEXITY A.1 Introduction

A.2 Previous Work

A.3 Projection techniques for text visualization

A.4 Kolmogorov Complexity as a means to define distance between texts A.5 Results

A.6 Conclusions

B Technical Report: CONTENT-BASED DOCUMENT MAPS USING FAST PROJECTIONS AND TOPIC DETECTION B.1 Multidimensional Projections for Mapping Collections of Documents

B.2 Exploring Content-based Document Maps B.3 Further Remarks

References

(3)

1. Overview, motivation, goals

Text collections are generally considered data sets with a high number of dimensions, where a dimension is a term or expression (an n-gram) of importance within the domain of the document collection. In a conventional vector representation of a document data set, the number of dimensions for a few hundred documents of moderate size can reach a few thousand.

Regardless of these difficulties and their implications for the analysis of text collections, there is a growing number of applications that can benefit from tools to effectively support analysis of document sets and from that analysis, allow the user to reach conclusions and to make decisions. The range of potential applications vary from health studies and diag- nosis based on medical records to investigation of unlaw- ful activities. The nature of the textual source is also quite varied. Documents can be snippets from web searches, RSS feeds of various kinds, scientific papers, reports, newspapers articles, automatically generated health test reports, patents, and so on. Every day new applications start to rely on text solely or combined with other data sources (such as table data and images). There is, therefore a compelling need for tools that combine user driven and automatically extracted displays to support the analysis of text collections and relationships amongst texts.

Finding ways to support analysts in quickly and mean- ingfully extracting meaning from text collections is also a strategic issue for the area of Visual Analytics, defined as

"the science of analytical reasoning facilitated by interactive visual interfaces" (from http://nvac.pnl.gov/). The fields of Text Mining and Visual Text Mining aim at combining visualization and mining approaches to achieve solutions for the exploration of text collections.

Handling high-dimensional data poses many problems to researchers and data analysts in general. Visualization techniques seek to bridge the gap between users’ visual perception and reasoning capabilities and analytical techniques.

Nonetheless, finding intuitive ways to visualize large high- dimensional data sets is a difficult problem. Traditional multidimensional visualizations, such as scatter plot matrices, parallel coordinate plots or pixel-based techniques [OL03]

operate by mapping each data attribute into a corresponding visual axis or other representation. As such, they can only handle well a very limited number of attributes and, therefore, are not directly applicable to complex data consisting of many attributes (dimensions) [MPL06].

A widely explored alternative to handle this type of data is to reduce dimensionality prior to visualization, e.g., by projecting the high-dimensional data points into a lower dimensional space (2D, 3D) that is more amenable to user interpretation. Various techniques exist to achieve such a mapping, mostly based on dimensionality reduction, dimensionality clustering, or point placement strategies. The resulting

projection can be displayed through a suitable visual representation — e.g., points placed on a plane, graphs, surfaces or volumes — that can be navigated and explored by a data analyst. Projections may be created based on different cri- teria, but typically they strive to preserve distance relationships amongst data points as defined in the original space.

Information loss is inevitable and the extent to which this distance-preserving goal is achieved depends on the precision of the projection.

Figure1depicts the overall process in generating interactive visual representations of document data sets via projections (or, in fact, other point placement strategies too). The user can feed the system with (1) a document collection, (2) a structured data table, or (3) a distance matrix. In the first case, documents are either converted into a vector representation or compared using a similarity measure that handles text against text. From vector representations, distance calculations can also be performed. From distance relationships the 2D projection is generated and the user can interact with its visual representation to gain insight to the data set.

This tutorial treats the problem of mapping and exploring, through visual representations, textual data sets and results of information extraction from textual data sets of various different natures. We describe the basic principles for generating such visualizations and make an effort to convey the benefits of achieving a solid graphical view of automatically extracted relationships. The problem of facilitating text analysis is by far an open issue. We hope to identify, suggest and explain the current technologies available and the challenges ahead for professionals dedicated to both using and developing techniques for that purpose. In this tutorial, we offer an overview of the existing techniques underlying techniques to allow the construction of visual document maps from unstructured data sets, that is, without relying on previously extracted meta data. The resulting map should offer insight into the contents of the document data set to support examination and relief the user from extensive reading otherwise necessary.

We begin in the next Section by illustrating the creation and exploration of visual maps of example text collections of very different natures in two quite diverse areas of application, by use of projection techniques.

Following that we describe, in Section 3, some of the most important basic concepts and techniques involved in this area.

Section4reviews the techniques published in the literature to achieve some of the goals stated here.

Projection based visualizations and their use for visualization of text collections are detailed in Section5. Finally, we discuss other underlying mining issues that go together with text analysis, such as Topic Extraction.

(4)

Figure 1:Generating projection maps of multidimensional data sets.

2. Two test cases: Visual maps of flash news and scientific papers

Visualizations of text documents and their meta-data can take many forms, whether or not they are coupled with data mining techniques. Since our focus is on non-structured data, here we illustrate the mapping based on content of two document collections bearing very different characteristics.

The goal would be, through an analysis process involving such maps, to be able to extract and map features of the data set in a way that can help locate patterns. In each applications, however, what needs to be located is different. In the first case, the mapping of news flashes, the target is to have an overview of the main events that have been reported on Web sites of some relevant news agencies. The second case illustrated here, the mapping of scientific papers, highlights the possibilities of identifying general subjects as well as specific issues and finding relevant material to explore further in academic and scientific applications.

Visually, the resulting representation used here is a graph, whose nodes represent points of the data set (individual textual documents) and whose edges represent some type of relationship between the connected nodes. The graph displays and interactions presented in this Section were generated using PEx, the Projection Explorer, a freely available tool for multi-dimensional visualization via projections and point placement strategies with text processing capabilities (see http://infoserver.lcad.icmc.usp.br/infovis2/PEx).

An alternative view of the same text maps is the landscape or surface view, where vertices are points and meshes codify additional attributes by mapping scalar fields to geometric or visual attributes (height, color, glyphs, etc.). Surface views in the following text are generated using the Super Spider tool, a recent evolution of its precursor, the Spider Cursor [MLN^∗05], that can be used to explore landscape plots.

The visual representation employed here, whose construc-

tion will be described along the text and slides of this Tuto- rial, can also be the result of various other tools for multidimensional data projection (see Section4.2, except, maybe, for particularly visual attribute mappins and for the flying quadric in Figure5.

2.1. Visual mapping of news flashes

The first data set we shall explore was built by gathering RSS feeds of news flashes from the WWW sites of four news agencies: Associated Press, BBC, CNN and Reuters. They represent all the news made available as RSS during approximately 24 hours in April 2006. After some elimination of duplicates, the gathering process resulted in 2,684 news files.

For the visual mapping, we have used title and content only, with the goal that similar news (supposedly news treating the same subjects) are mapped to the same neighborhood.

For the exploration after mapping, the full file can be used, which contains also the date and the source of the news (the name of the news agency).

The size of the text in an RSS feed (which, for many applications, synthesizes a larger file) is usually quite small, typically 25 to 80 words only. This test case bears resem- blance to many applications available today that wish to analyze RSS feeds of all sorts of information (for instance, discussion groups, support groups, article repositories, patent repositories, dictionaries, libraries, and thousands of other such repositories).

Figure2shows pictures of two different maps from this data set. In Figure2(a), which employs a multi-dimensional projection technique called Least Square Projection, or LSP [PNML07,PNML06], the region labeled A is a group of news articles on the Vioxx dispute. Region B centralizes all news on an immigration bill undergoing discussion at the US senate. Group C concentrates the news on a trial related to the Enron case (a particularly important deposition was

(5)

about to take place). By coloring nodes according to the source of the news, it is also possible to identify the degree of importance a particular news agency gave to a subject by es- timating the number of points with its color. Conversely, the same observation supports identification of the news agencies that dedicated a lot of — or just a little — effort to report a particular event.

Figure2(b)shows the same map as Figure2(a), slightly zoomed in, now presenting a set of labels that support identification of the subject mostly tackled by a particular user- selected group of documents. The set of labels in that picture, calculated by a term covariance approach (see Sec- tion5.3.1), identify some of the central events of those two days.

Interpretation of groups in this type of projection is very similar to interpretation of clusters, that is, highly similar data is identified by proximity in the 2D display. In the very middle of the display remain the groups that have been ’un- resolved’ by the current context, which means that according to the strategy employed for similarity in this domain, it was not possible to distinguish them from the others. Coloring in Figure2(b)was carried out using a hierarchical clustering method in 2D (in this case k-means with average linkage weighting). It helps distinguish regions of higher density of points (in the ‘blue’ end of the spectrum) from regions of lower densities (moving towards red as regions become more sparse).

Figure2(c)shows the visual result of another map of the same data set, built using a new technique for point placement of this type of similarity-based data, called Neighbor- joining Tree (or N-J tree) [CPMT07]. An N-J tree is a type of similarity tree visualization that employs the same principles of phylogeny tree reconstruction [SN87] to reflect the organization of data points (in this case documents) according to their similarity. The interpretation of that type of visualization is slightly different than straight projections. In an N-J tree, very similar (according to the similarity measure employed) data points are gathered in the same and nearby branches of the tree. Proximity on the 2D plane itself does not imply similarity. Interpretation of similarity trees needs, therefore, the branching information to be effective. Branches guide selection and focus also; the neighborhood of a point is extracted searching up and down the tree.

Groups of documents with content that is consistent (that is, that are best resolved by the similarity measure employed) are placed at the outer branches. Figure2(c)uses covariance labels to illustrate the placement of some of the main news in the data set.

Closer examination of the flash news visualizations reveals that many of them are just headlines followed by a short text of the sort ‘read full story for details’. Figure3(a) highlights, in blue, the points where that occurs. Removing those can help unclutter the view, which was done by build- ing a new map after removing those blue points (see Fig-

ure3(b)). Similarly, other searches on the map can help high- light, select or eliminate specific files and groups of files.

The news data set, after the reduction illustrated in Fig- ure3, has 2450 news items remaining. The similarity tree by neighbor-joining of that reduced data set is presented in Figure4(a). That picture also shows a selection of a branch for further examination. The region is zoomed in on in Fig- ure4(b), and further levels of topic extraction can be ob- served. We can see in the central part of that picture that the similarity calculations gathered news on various linked Palestinian issues in neighboring branches. These issues varied from formation of the Hamas government to Israeli strikes in Gaza to halting of financial aid from Europe and the United States. The lighter label in Figure4(b)is the title of one of the news files in that map.

A surface view of a projection-based document map can be built by creating a mesh from the projected points and then using scalar fields attributed to vertices to reflect properties. The result of such a process is illustrated in Figure5.

Various attributes can be mapped to scalar fields, from search results (for instance the number of appearances of a particular expression) to the results of clustering, classification, and categorization of data points. In Figures5(a)and5(b)color reflects a 2D clustering (also by k-means) of the points in the projection, employed to help locate pockets of closely projected documents. Figure5(c)illustrates the capabilities of simultaneous presentation of multiple scalar fields using visual and geometrical attributes. In that picture, height represents the count of the number of times the word ‘Bush’

appears in the document (higher points are higher counts, naturally). Each of the news sources (four in all) is reflected by a different color. That same information is redundantly reflected on the color of a ‘flying’ super-quadric that changes color as the user browses through the map. The visual and geometric mappings of the super-quadric can reflect properties of the currently explored point (or focus point). In the case of Figure5(c), only its color changes, mapping the news agency that published the document in the focus point. Label by title clarifies what the news is about.

Finally, another type of projection, called Proj- Clus [PM06] is used to map the same news data set explored thus far (see Figure6). Figure6(a)shows the graph view with 2D k-means clustering for grouping location and Figure6(b)shows the mesh view of the same map. In Figure6(b) color is hierarchical clustering, that is, darker blue regions are regions of largest concentrations of points, down to red for sparser regions. ProjClus tends to spread the clusters around more, which is confirmed by larger regions with the same top density. Although that property sometimes confuses the boundaries of clusters, ProjClus still bears good ’proximity by similarity’ precision and can help estimate the local density of a region since it diminishes cluttering compared to LSP.

Both projections thus far employed in the illustrations are

(6)

(a) LSP map. Colors represent different news sources (4 in all).

(b) Labeling of regions of map in Figure2(a)for general topic identification.

(c) Tree-like map visualization by neighbor-joining with labels.

Figure 2:Document maps of 2,684 news flashes collected for approximately 24 hours in April 2006.

reasonably fast, leading to an environment were simultaneous examination and coordination of various maps is feasible. Similarity trees have higher computational cost but it is still manageable for data sets whose size is in the hundreds of points, even in the thousands, if similarity is pre-calculated and stored. Neighbor-joining algorithms are typicallyO(n³) orO(n²)at best.

2.2. Visual mapping of scientific papers

The second data set to be illustrated here is composed of 676 papers in four areas of knowledge: Case-based Rea-

soning (CBR), Inductive Logic Programming (ILP), Infor- mation Retrieval (IR) and Sonification (SON). CBR and ILP documents were extracted from Lecture Notes in Com- puter Science (LNCS) in those subjects. IR and SON papers were retrieved from internet repositories. They are pseudo- classified according to their source. The maps were built using papers’ titles, authors, affiliations and references.

In addition to those documents, six others were added to the data set. These were papers of work published by members present and past of our research group; the goal was to evaluate their local and global positioning on the map.

From those, five papers were highly related. They refer to

(7)

(a) Highlighting news with words ‘full’ and

‘story’ on map of Figure2(a) (b) New map without news highlighted in Figure3(a). 2450 news files remained Figure 3:Using search and corpus manipulation to eliminate unwanted files

(a) N-J tree of the reduced news data. Selection of focus region.

(b) Scaled presentation of focus region and its surroundings . Figure 4:Focusing on a part of a Map for further exploration

the evolution of a sonification system and its corresponding user evaluation. Two of them refer to the first version of the system and the other three to its newest version. The last paper was also published having as authors members of our research team, but on a completely different subject (a technique used for image segmentation). Let us call those six additional papers the ‘intruders’.

Figure7shows general and specific views of the papers data set, mapped by LSP. It can be seen in Figure7(a), both by coloring and by labeling (each color here represents a pseudo-class), that the projection technique was capable of separating well this heterogeneous document collection by their subject areas. Moreover, subjects with an established and more focused body of techniques, such as CBR and ILP

(8)

(a) Mesh generated from a Delaunay triangulation of news map in Figure3(b). Color is k-means clustering of the projected points.

(b) Surface view of news set. Color and height codify the same scalar field of3(b).

(c) Surface View. Color reflects the source of the news; height codifies the number of appearances of the word ‘Bush’ in that particular text; the color of ‘flying’ geometry on the display confirms the source of the news (i.e., the news agency here represented in red) of the focus point; the focus is highlighted by its incident edges colored in blue; label is title of the focus point.

Figure 5:Surface representation and exploration of content by visual attributes.

tend to be displayed less sparsely than a newer more diverse area such as Information Retrieval. In the various regions where points of different colors appear close to one another, further examination reveals that there was a high content cor- relation that lead to that proximity. Most of the time they are applications of techniques in those areas (CBR, ILP, IR, SON) for particular goals (such as modeling, manipulation, or study of particular data sets such as biological sequences).

To check that capability further, it can be seen in Fig- ure 7(b) that all except one of the intruder papers were mapped in the same group. The five sonification papers share a very close neighborhood within the general sonification group, as would be desired. The sixth, unrelated paper, joined another separate group, that seemed to gather together

papers from different subject areas. Closer examination by focussing on that particular cluster and loading the documents represented there (see Figure7(c)) shows that these papers deal with various techniques (in all subject areas) de- signed to treat imaging tasks (such as retrieval, manipulation and segmentation). In fact all papers in the paper data set dealing with imaging techniques were placed in that group.

Connections amongst nodes on the graph can carry various meanings. For instance, they may reflect neighborhoods within the data set. Figure8illustrates the display of similarity relationships also as edges on the graph. In Figure8(a) documents are connected according to the similarity measure employed for the creation of the map. Each point is connected to its next neighbor in the similarity matrix. It can be

(9)

(a) ProjClus map, which spreads groups on the 2D plane; color reflects 2D clustering

(b) Mesh view of ProjClus news map; color reflects levels of concentration of points.

Figure 6:Map by ProjClus of the reduced news set.

seen that there are a few points whose next neighbors are not within the same general group. That is to be expected, since the main groups are not at all disjunct as far as content similarity is concerned. Still, the small number of ’crossings’

outside the clusters indicates the good degree of grouping presented by the projection. New connections can be added.

For instance, each point can be connected to its two nearest neighbors in the original data space (see Figure8(b)); in this case more connections across groups appear, and they can reveal further associations apart from those already revealed by grouping and first neighbors, adding to the possibilities of exploration.

Many other relationships could similarly be mapped to edges on a map. Focussing on parts of the map shows in further detail the subjects within groups. Detail of two groups of the map (see Figure8(d)) reveals that, within the general sonification group, there are separate groups for audio processing (including retrieval), computer music, sound in user interfaces, data sonification, and sonification for the visually impaired.

Display on the 2D plane of multidimensional data can be done using a number of techniques. As well as projections, other point placement strategies can be employed. Some- times it is useful to arrange (or rearrange) a projection according to the connections amongst points. That is illustrated in Figure 9(a)and 9(b). Other important issues in document map exploration are topic extraction and visualization, which help the user find interesting groups of documents to explore further. Figure 9(c)illustrates that using a proper technique (selective generation of association rules, in this

case), further detail of topics approached by documents can be viewed.

Finally, we map the scientific paper data set using N-J trees. Since the tree reflects an organization of the similarity measure, it reveals subtleties not easily identifiable in projection layouts. For instance, in Figure10(a)and10(b), just by having two different approaches to display the same tree, it is possible to locate different levels of grouping and separation of groups. Additionally, in the same pictures it can be seen that the positioning of the ‘intruder’ papers in the same neighborhood happens in two levels for those dealing with sonification. The two papers referring to the first version of the sonification system are gathered in the same branch while the others are together in another branch nearby. Fig- ure10(c)shows an N-J tree map built from another similarity measure, in this case Normalized Compressed Distance (NCD) [TMP07,CV05], see below for more details. Over- all NCD gives similar separation of greater groups for this same data set. However, close examination shows that not all relationships are maintained. For instance, in that map all the intruders were, according to that NCD, deemed more similar than using vector representation. This type of presentation gives a good reflection of how a particular similarity measure behaves with a data set.

The cases shown here illustrate the use of projection- based displays together with mining techniques (such as clustering, covariance term extraction and association rules extraction) in an integrated manner, with combined available technology to build maps for exploration of text collections.

These and other tools put together can support improvement

(10)

(a) LSP map with main topic labels, automatically extracted. (b) ‘Intruders’ papers, five on sonification and one on a segmentation technique, are displayed in green in the highlighted regions.

(c) File content view of papers in group that gathers techniques for imaging.

Figure 7:Map of scientific papers in four classes, CBR, ILP, IR and SON, using LSP.

in examination of larger text collections without extensive reading.

The next section starts detailing concepts and techniques related to the construction and exploration of such maps.

3. Basic Concepts

3.1. Text processing and information retrieval 3.1.1. Introduction and topics

We summarize the fundamentals of text processing and information retrieval here. The main topics to be explored are:

• Modeling for information retrieval

• Retrieval evaluation methods

• Query languages and operations

• Text and multimedia languages and properties

• Text operations

• Indexing and searching

Whereas this include non-text aspects, such as multimedia languages and properties, we leave those here to provide a more comprehensive picture of the subject.

3.1.2. Motivation

Information Retrieval (IR) has become a major factor in today’s information-centered world. Whereas in the past, only very technical people used to interact with data sources

— actually, mostly databases — today lay people used data sources on a daily basis. Every Web search, every Web-based purchase, and many other application on- and off-line, all focus on data sources and data bases, whether the user is or is not aware of it.

3.1.3. Information versus data retrieval

3.1.3.1. Data retrieval : Data retrieval interacts with databases, which are sources of structured data. Databases do not usually contain broad information about a specific

“subject or topic”, but, rather, store specific data in structured tables, such as catalogs, populations data, and the like.

Retrieval from a database requires some knowledge of a query language — the most prevalent variations on SQL, the Standard Query Language. Data retrieval will only succeed if the database includes specific instances of precise query terms. A slight misspelling can make the difference between hit and miss, between success and failure.

(11)

(a) Connection by 1-neighboring points according to similarity measure.

(b) Connection by 2-neighboring points according to similarity measure.

(c) Selection of part of the map

(d) Focus on the part of the Map selected in Figure8(c), with labeling by covariance

Figure 8:Maps of scientific papers with connection representing similarity in the original space.

3.1.3.2. Information Retrieval (IR) : IR, on the other hand, focuses on retrieval from unstructured of documents that are related by topic or subject. Query terms are not necessarily expected to be present in their exact form in a document for it to be returned in the result list. Instead, IR systems attempt to interpret the contents of information items

— mostly, text documents in a collection. This interpretation involves a comparison between the entered query and the documents in the collection, computation of a “difference” or “distance” measure, and ranking of the returned documents according to their relevance to the user’s query.

3.1.4. Information retrieval at center of stage

Information Retrieval has been utilized by librarians and other information professionals for close to three decades.

However, with the penetration of the World-Wide Web, it is now one of the most important tools a wide variety of users utilize in their day-to-day interactions with information. The Web has become the single largest repository of, as well as the unified interface to most of the information people have to interact with. However, the Web presents a

significant obstacle to effective and efficient information har- vesting, namely, the absence of a well-defined underlying data model for Web-based information. This typically leads to information definition and structure that is frequently of low quality.

3.1.5. Basic concepts of IR

The IR process starts with a user’s need. The user’s initial task is to translate her information need into a query, specified in a language that is provided and understood by the system. We have all grown accustomed to entering multiple daily queries into our favorite search engine. If the user’s query is entered into an IR system, she typically would spec- ify a set of “key words”, which are expected to convey the semantics of the information needed. Using a more rigid data retrieval system, a much more rigorous query expression is required, such as a regular express ion, which contains constraints to be satisfied by objects in the answer set. In both cases, the user searches for some useful information execut- ing a retrieval task.

(12)

(a) Rearranging the position of the subgraph of Figure8(c) by using a force-directed placement strategy.

(b) Labeling by covariance of main topics of detail map.

(c) Detailed topic extraction by association rules [LPMP07].

Figure 9:Exploration of a sub-graph with rearrangement and topic extraction and visualization.

A schematic overview of the basic iterative retrieval process, the logical view of a document, and a more detailed picture of the retrieval process, can all be found in the accompanying slides.

3.1.6. A brief history of Information Retrieval

Here, we summarize the history of Information Retrieval, covering the following topics:

1. Early developments;

2. Information Retrieval in the library;

3. the Web and digital libraries.

3.1.7. Early developments

For roughly 4000 years, people have been organizing information to be able to later retrieve and use it as they need.

One of the most common examples is the familiar Table of Contents found in many books and other text documents.

After Gutenberg’s invention of the moveable printing press in 1455, which has made books available and affordable to the general population, rather than just to the wealthiest few who could afford them, the volume of information people have acquired grew beyond just a few books. This, in turn, triggered the development of specialized data structures for faster access to stored information. For example, an old popular data structure for faster IR still found in many books is the index.

An index is a collection of selected words and concepts that serve as associated pointers to related documents and other information sources. Indices are at the core of any modern IR system. They provide faster access to data and they speed the task of query processing. For centuries, indices (aka indexes) were created manually, according to categorization hierarchies. Today, libraries still use categorical hierarchy — which have usually been conceived by human subjects from the library sciences field — to classify volumes and documents.

Computers have lead to the automatic construction of large indexes, which has lead to another view of the retrieval problem, one that is much more related to the system itself than to the user’s needs.

We, thus, face two views of the IR problem, a Computer- centered view, and a human-centered one.

3.1.8. A computer-centered view of the IR problem As far as computer systems are concerned, the IR problem has the following components:

• build efficient indexes;

• process user queries with high performance; and

• develop ranking efficient algorithms.

All, with one goal in mind: to improve the ‘quality’ of the retrieved answer set .

(13)

(a) N-J tree based on vector representation using cosine distance;

initial position

(b) Vector based similarity tree, after repositioning the nodes.

(c) N-J tree from NCD similarities.

Figure 10:Similarity trees for the scientific papers data set.

3.1.9. A human-centered view of the IR problem From the human user’s point of view, the IR problem can be summarized as:

• Study the typical behavior of users;

• understand their main needs; and

• determine how such understanding affects the organization and operation of retrieval systems.

From this point of view, a keyword-based query processing might be seen as unlikely to yield a good solution in the long run, leading to the quest for other, more promising alternatives.

3.1.10. Information Retrieval in the library

Libraries were among the first to adopt IR systems. Those systems were developed initially by academic institutions but later by commercial vendors started offering more re- fined tools and systems. We observe first, second and third generation features in those systems.

3.1.10.1. First generation . The first generation of library IR systems was basically an automation of previous technologies, mostly card catalogs. They allowed searches based on author name and title only.

3.1.10.2. Second generation . Here, increased search functionality added search by subject headings, keywords, and some more complex query facilities.

3.1.10.3. Third generation . More recently deployed, these systems focus on improved graphical interfaces, elec-

tronic forms, hypertext features, and open system architec- tures.

3.1.11. The Web and digital libraries

The Web has become the primary source of all kinds of information. With the spiral growth of the amounts of information available on the Web, search engines have become among the most important Web tools — and assets. Today’s search engines continue to use indexes that are very similar to those used by librarians a century ago. So what has changed?

We observe three dramatic and fundamental changes:

First, access to various information sources has become a lot cheaper, reaching wider audiences than ever possible before. Second, advances in digital communications have lead to easier, cheaper, and thus greater access to networks, al- lowing distant, quick access to vast amounts of information throughout the world. And, third, the freedom to post what- ever one would like to, has caused the Web to become ever more popular. For the first time in history, most people have an almost free access to a large publishing medium, making the Web (and modern digital libraries) a highly interactive medium, facilitating the exchange of messages, pho- tos, documents, software, video; making interactive chatting convenient and at low (or no) cost, cause overall a fundamental shift in the current communication and information flow paradigm.

3.1.12. The future: three main questions

We are still not through; we do not have all the knowledge, all the answers. The following still need to be addressed:

(14)

First, despite the high interactivity, people still find it difficult — impossible at times — to retrieve the information that would be relevant to their information needs, leading to the question “which techniques have the potential to yield retrieval results of higher quality?”

Second, ever increasing demand for access is causing the need for quicker response to be more and more a pressing factor. We therefore need to find out which techniques can support faster indexes and smaller query response times?

And,

Third, the quality of the retrieval task is greatly affected by the user’s interaction with system, so “how can better understanding of the user’s behavior affect the design and de- ployment of new information retrieval strategies? ” 3.1.13. Practical issues

The following issues have also great influence on the entire IR problem and solution set.

• Electronic commerce, a major trend on Web;

• security;

• privacy;

• intellectual property rights and publisher responsibilities;

• internationalization; and how to deal with multimedia (images, sound, video, etc.)

3.1.14. Overview of the rest of IR topics

There are many components to IR. We list here the major topics; these remain, however, outside the scope (and time and space limitations) of this tutorial. The reader is encour- age to explore these further. A very comprehensive treatment

— and, thus, a good place to start — would be [BYRN99].

1. IR modeling;

2. retrieval evaluation;

3. query languages and operations;

4. text and multimedia languages and properties;

5. text operations;

6. indexing and searching;

7. parallel and distributed IR;

8. user interfaces and visualization; and 9. searching the Web

3.1.15. Text preprocessing for IR, mining, and visualization

In order to reduce the space dimension and to allow re- fining the text model, text is processed prior to extraction of the vector representation. This process typically involves three steps: (i) removing stopwords, i.e., non-informative words, such as articles, prepositions and such, plus any words known to lack relevance to the context; (ii) stemming, which reduces words to their radicals (e.g., ‘motivation’,

‘motivate’, ‘motivating’, would be all reduced to ‘motivat’);

and (iii) frequency counting, so as to remove terms that occur too sparsely or too often and hence have little differential

capability. Setting suitable frequency thresholds for discard- ing terms typically demands user knowledge and interaction.

Corpus vector representation enables handling the map generation problem as one of mapping objects defined in a high dimensional space into a 2D (or 3D) visual representation space. Clustering and projections are typical approaches to handle the problem. Determining vector similarity — a necessary step for both processes — requires defining a metric in the high dimensional vector space that allows computing distances between vectors. The distance between any pair of document vectors is a measure of their proximity, or semantic similarity. Distances for all pairs can be stored in a triangular distance matrix of dimensionsnxn, wherenis the number of documents.

An alternative to the vector space model has been proposed based on Kolmogorov complexity approximation [TMP07], called Normalized Compressed Distance (NCD).

NCD computes a distance between a pair of documents straight from their textual content, rather than from the intermediate vector representation. The Kolmogorov complexity of a stringx, denotedK(x), is the size of a description of xproduced by an optimal specification method. The conditional Kolmogorov complexityK(x|y)ofxwith respect to another stringycan also be defined as the amount of information thatxdoes not have abouty.

An information distance between two given strings can be specified based on their conditional Kolmogorov complexity. Though the problem of evaluating a string’s Kol- mogorov complexity is non-computable, a solution can be approximated through compression; this yields the basis for the definition of NCD. We include a technical report in Ap- pendixAon a formulation of NCD for text collections at the end of the Tutorial notes.

Other alternatives for pairwise similarity calculation, particularly those based on information theory, have also being studied recently [AF03].

3.2. Data and text mining

3.2.1. The knowledge discovery process

Data Exploration is the process of searching and analyz- ing databases to discover implicit but potentially useful information, mostly for the purpose of supporting the decision making process. The goals of data exploration are:

• Convey information;

• discover new knowledge; and

• identify structure, patterns, anomalies, trends, and relationships.

Major Data Mining Tasks, such as, summarization, association, classification, prediction, clustering, and time-series analysis use major techniques, including Linear Regression Trees, Non-Linear Regression, MARS, Naive Bayes, K- Means and K-Median, Neural Networks, Association Rules,

(15)

Decision Trees, Principal Component Analysis, Support Vector Machines, and Genetic Algorithms. These, in turn, are based on statistical tools, such as, Missing Value Impu- tation, Normalization techniques, Error & Variational Anal- ysis, and Confidence Estimates, to list just a few.

3.3. Projection techniques

Data sources have increased substantially both in size and complexity, but extracting useful information from them is still a challenge. One measure of data complexity is the number of attributes associated with each instance of data. Con- sider, for example, data from a demographic census: a data instance records attributes such as age, sex, education, oc- cupation, income, and so forth. Considering each data attribute as a data dimension, if we havemsuch attributes each data instance can be interpreted as anm-dimensional vector placed in anm-dimensional definition space.

In traditional statistical analysis, data instances with three or more dimensions are known as multivariate or hyper- variate data. In Information Visualization such data is usually referred to as multidimensional. Conventional methods, such as scatter plots or bar charts, normally employed to as- sist data interpretation, are not directly applicable to multidimensional data. Moreover, identification of patterns and models grows more difficult as dimensionality increases, and lack of proper representations can severely impair interpretation.

A common way to handle dimensionality is to reduce the number of dimensions, so that strategies that are known to work well with low-dimensional data can be applied.Multi- dimensional Projectiontechniques are one example of such a strategy. A multidimensional projection technique typically maps the data into a p-dimensional space with p= {1,2,3}, while retaining, in the target space, some information about distance relationships among the data items in their original definition space. In this way, a graphical representation can be created to take advantage of the human visual ability to recognize structures or patterns based on similarity, such as clusters of elements.

Formally, let X = {x₁,x₂, . . . ,xn} be a set of m- dimensional data, with δ(xi,xj) a dissimilarity (distance) measure between twom-dimensional data instances, and let Y ={y₁,y₂, . . . ,yn}be a set of points in a p-dimensional target space, with p={1,2,3}andd(yi,yj)a (Euclidean) distance between two points of the target space. A multidimensional projection technique can be described as a bi- jective function f:X→Y that seeks to make|δ(xi,xj)− d(f(xi),f(xj))|as close to zero as possible,∀xi,xj∈X.

3.4. Visual representations: graphs, surfaces, volumes, triangulations

Many visual representations have being utilized to represent data. Among the most commonly used ones one finds

maps, graphs and networks, surface- and volume representations. The use of maps is so widespread that it is not necessary to discuss them, as every one knows their utilization for geographical applications. The concept has been used also to represent data in which some form of “proximity”, whether in physical space or as determined by other measures, is needed to be presented.

Whereas maps are particularly useful to represent physical proximity or its conceptual equivalent, graphs and networks have been primarily utilized where the relationships among participants are to be emphasized. In these types of situations, measurable physical is not important; what is important is how nodes are related to each other. For example, the communication patterns among members of an organization can be nicely visualized by a graph, in which a link between two nodes represents the communication between them. A directed link can show who initiated the communication. A network of such communications can quickly show influential figures, such as “authorities” (those who are considered knowledgeable about a variety of subjects) and

“hubs” (those are know who to seek when a question needs to be answered). The mapping of authorities and hubs have been the cornerstones of the original “page rank”, the algo- rithm that provided the initial implementation of Google’s search engine. In addition, graphs and networks are common representations for processes in which transitions occur between states, under certain conditions.

Spatial data — and, at times, data that can be mapped to a spatial representation — have been often represented by volumes, or their surfaces. Volume representations preserve the values of data throughout the volume, whereas surface representations discard internal values and only preserve values of the external surfaces selected to be displayed.

The two most crucial aspects of either are the data structures utilized to represent the data, and rendering techniques to display the data on a screen. Multiple data structures have been proposed for storing data of three (or higher) dimensions. Among them, in addition to the straightforward carte- sian coordinates, are various hierarchical representations, including trees of various type. The discussion of such data structures are beyond the scope of this tutorial; the reader is referred to the standard data structure literature for details.

Surface representations have included various triangulation techniques, as well as cuberille approaches. Triangula- tion techniques approximate the surface with a mesh of tri- angles that attempt to follow the surface as closely as possible. Cuberille techniques try to approximate the surface with a mesh of squares rather than rectangles. Some use graph concepts to implement optimal coverage.

The display of three- or higher-dimensional data on a two- dimensional display requires projection first. Surface and volume rendering techniques have been covered extensively in the computer graphics and visualization literature. The

(16)

reader is referred to this literature to familiarize her/himself with the technical details entailed.

The accompanying slides provide some examples of various of these techniques.

4. From Visualization to Visual Text Mining

4.1. Visualization techniques for multidimensional data

“One picture is worth a thousand words” is an old cliche, but it is based in the fundamentals of human perception and processing of information. Sixty percent of all input for decision making comes into the brain of a normal-visioned per- son from the visual system. It has been demonstrated that, in the event of contradicting information between the visual input and that of another sense (e.g., tactile), the visual stim- ulus with “win” over the other one, even if the other sense is providing correct information and the visual input is erro- neous!

Graphics and Visualization are meant to help the user see (understand), remember, compute, analyze, discover, enjoy, and much much more.

Multiple goals are served by different types of visualization tasks, and the appropriate techniques. The most common visualizations used are for the purpose of presentation of known information. Here, facts to be presented are known (though they may not represent the truth), the visualization process is to choose and tune the appropriate visualization technique, and the result is a high-quality visualization of the data and analysis to present facts (often without the author’s presence). Confirmatory analysis visualization starts with some hypotheses about the data, proceeds through a goal-oriented examination of the hypotheses, and results in a visualization of the data to confirm, accept, or reject the hypotheses. Finally, exploratory analysis starts with no hypotheses about the data, proceeds with an interactive, usually undirected search for structures, trends, patterns or anomalies, and yields a visualization of the data to lead to some hypotheses about the data.

All of these utilize a wide variety of technique, some are pure visualizations, some are integrated with analysis, all utilizing to one degree or another an assortment of interaction tools, which help the user control the process and interact with it to yield optimal results.

Some pure visualization techniques include 2D and 3D Scatterplots, Matrix of Scatterplots, Statistical Charts, Line and Multi-line Graphs, Parallel Coordinates, Circle Seg- ments, Polar Charts, Survey Plots, Heatmaps, Height Maps, Iconographic Displays, RadViz, PolyViz, and many more.

Among those that are integrated with analysis one finds Projection Pursuit, Dimensional Stacking, Sammon Plots, Multi-Dimensional Scaling, Principal Component Analysis (PCA) and Principal Curves, and Self Organizing Maps.

Interactions include Selection, Probing, Querying, Grand Tours, and Non-linear Zooms.

It would require hundreds (if not thousands) of pages and hours to discuss all of these in even a moderate amount of details, so we will limit our discussions here to only the very basic concepts, and to a few methods that are particularly relevant to our main focus, Visual Text Mining.

4.1.1. The visualization problem

The visualization problem can be summarized as: Massive amounts of data from various sources, including databases, simulations, sensors, decision systems, and more; limited screen space; and little knowledge about the human perceptual system and the process of information transfer.

Visualization is a method of computing. It trans- forms the symbolic into the geometric, enabling researchers to observe their simulations and com- putations. Visualization offers a method for seeing the unseen. [MDE87]

Visualization now includes other data representations, such as auditory, haptic and tactile, potentially more in the future.

4.1.2. Classifications of visualization techniques We are primarily interested in those visualization techniques that are suitable for multi-dimensional data sets, as visual text mining deals in high dimensional data.

Many classifications have been proposed for multidimensional data visualization techniques.

4.1.2.1. Point-based Point-based techniques map the dimensions of a data record to the attributes of a point on the display, including its size, shape, color, motion, and — oc- casionally — sound at the point. The most recognized point- based technique is the scatterplot. Scatterplot-based visualization techniques include OmniViz Galaxy (OmniViz, Inc.), Temple MVV Graphics [MGTS90,MTS91], [CC92], and Splat Visualizer and Scatter Visualizer (from Purple Insight).

Because the traditional scatterplot is essentially a two- dimensional data technique, various approaches have been propose to extend scatterplots to higher dimensions. These approaches include layout extensions, either via matrices or radially, and shape extension, utilizing icons and glyphs.

Scatterplot Matrices provide an xy layout of k- dimensional data in a total ofk(k−1)/2 scatterplots, each showing two of thekdimensions against each other. Clearly, askgrows, the number of scatterplots grow too. E.g., for k=100 the number of scatterplot will be 4,950. The concept has even been extended to scatterplot cubes, following the same idea.

RadVis [Hof99] is a point-based technique that organizes

(17)

data dimensions around a circle. Data dimension values function like springs, exerting forces over a data record in various directions. The result is a collection of point clouds, clustered based on dimension values.

Hyperslice is a matrix ofk²slices through ak-dimensional data set. Slices can be determined interactively [vWvL93].

4.1.2.2. Line-based Line-based techniques include bar charts, line graphs, parallel coordinates.

4.1.2.3. Hierarchical, graphs, trees These methods utilize hierarchical structure, networks, graphs, and trees to present data that is inherently suitable for such structures.

Among the best known technique one finds Eick’s SeeNet [BEW95], which utilizes spoke, helix, and sphere layouts to plot networks of nodes and their relationships; SeeNet 3D [Eic96], which is an extension of SeeNet to three dimensional rendering; SeeNet ArchView, a member of the SeeNet family, which uses arches as the visual primitives;

MBone [MB94]; and DBMiner [HCC^∗97].

4.1.2.4. Iconographic displays Iconographic displays are perceptually driven displays. They generalize the notion of a pixel to that of an icon (or a glyph). By doing so, they in- crease the number of parameters “displayed”. Icons’ visual, color, and auditory attributes are mapped to data parameters, thus making the icon “data-driven”. The icons are displayed en-masse and thus harness the perceptual powers of the human early visual system. When presented, icons are available for user interaction. These methods have been successfully used in data fusion of multiple parameters / dimensions.

Among the best known iconographics/glyph systems are Chernoff faces [Che73], Andrews’ graphs, Stars or circular plots (Ward et al., various), Stick-figure icons [PG88], Color and Texture [H.91], Beddows’ Embedded blocks, Smith’s Sound (1989-90), and numerous other examples dated during the nineties.

The accompanying slides provide many examples of these various visualizations.

4.2. Visualization techniques and systems for handling document collections

Several different techniques for visualization of textual results from Web and other searches have been proposed ( [ABY03,LA00,SCL^∗99,BY96]). While these techniques are capable of displaying large document bodies, they tend to make the location of specific relevant reading material difficult. We focus here on complementary tools to support mapping of documents in a way that helps locate neighboring similarities between individual text documents and groups of documents. We therefore assume a pre-filtering task that reduces the universe of targeted documents to a few hundreds

(maybe thousands) in a few areas of interest (not necessarily pre-determined).

Many techniques for text document analysis and visualization exist. They usually search for a representation of the content of an individual document or text fragment (e.g., [MWBF98], [RES98], [RES99]), of document collections (e.g., [LA00], [BCG^∗99], [Wei01]), or of themes approached in text documents (e.g., [HHWN02], [Wis99], [WTP^∗95]) in order to meet these goals.

The process of text document analysis and visualization usually involves three phases: (i) pre-processing; (ii) dimension reduction; and (iii) attribute mapping to (at least) a visual representation and presentation.

The pre-processing phase takes as input the document collection to be analyzed, and produces an intermediate form, usually based on the vector space model [Sal91] whereby documents are represented as points in a vector space. In this representation each document is represented by a vector whose dimensions are terms (n-grams). The vector coordinates are the weights of the terms based on their frequency of occurrence in the document. Typically, dimensions reach the thousands even for small to medium-sized databases. Trans- formation of a document collection into a vector space is preceded by elimination of non-influential words (such as stopwords), reduction of words to their radicals (stemming), and some sort of frequency counting (various exist).

The second phase, dimension reduction, typically involves removal of words that are either too frequent or too rare in the collection, and clustering dimensions to generate new ’combined’ attributes.

The most common way to extract structure from a text document collection is by applying some dimensional reduction technique over the resulting vector representation. Sys- tems that implement such approaches include those based on Multi-dimensional Scaling (MDS), Principal Component Analysis (PCA) or Latent Semantic Analysis (LSA), which work with statistical measures for subspace reduction, as well as Self-Organizing Maps (SOM), which employ neural computation ( [Wei01], [Wis99], [BCG^∗99], [KHLK98], [WTP^∗95]).

In the third phase, such techniques can be used to reduce the dimension to two and then plot the data onto 2D space.

However, multidimensional reduction techniques can cause some difficulties, such as [HWR03]: high information loss when applied directly to two dimensions (for display); reduction in input dimensions do not seem to affect greatly the outcome; and there is an inherent discretiza- tion problem associated with techniques such as SOM, by which individual documents in groups are not distinguish- able. For our goal here, dimension reduction can pose an additional problem: when used to display the results in 2D, the mappings to subspaces may define groups of ‘similar

(18)

documents’, but it is not possible to locally relate neighboring documents. Previously [LMMP06], LSA has been successfully applied to generate document maps with high local content relationships, but the high computational cost remains a problem as well as the handling of vector trans- formations. Minghim et al. [MPL06] have proposed faster mapping approaches that posses the ability to associate documents by similarity, based on dimensional reduction by fast projections (such as Fastmap [FL95] and Nearest Neighbor Projection (NNP) [TMN03]. The gain in processing time is attained by using these projection techniques, which provide an initial point placement that is prone to speed up force- based improvement schemes. Those too are based on the vector representations. Those projections provide an initial placement on the visualization plane, but also require further use of point placement strategies [TMN03] in order to recover information lost during projections.

Point placement strategies, including force-based point placement, have been used before to generate document displays [Cha96,AKS^∗02a,GSAK04]. They can avoid, partially or completely, the extensive calculations needed in dimensional reduction techniques by starting with a semi-random point placement and re-adjusting their position based on at- traction by similarity. However, our experience trying to find relations in their application to text content has been that the precision of the placement is limited as applied to our goals for these maps.

Other major techniques to visualize inter-document similarities are document networks ( [FFW91], [TC89]), spring embedings ( [CC92], [SA98]), and document clustering ( [AOL93], [HP96], [LC96], [ZE98]). Of those, only document clustering is sufficiently fast and produces intuitive results to warrant consideration ( [ZE00], [HP96], [Zam98], [ZK05], [CHR03], [RK04]). It is often employed in combination with dimensional reduction and SOM ( [IR01], [Wei01], [LA00], [Koh97], [Lin91], [Mer97]).

Document clustering techniques provide a way to relate documents to each other, as well as to a set of topics. A document is assigned to a cluster if it contains the contents of the cluster’s topic, and is generally similar in contents to that topic and to the other documents in the cluster. Documents in the same cluster are expected to be more similar to each other than to documents in other clusters [ZK05]. Since documents may contain multiple topics, some approaches allow a document to be assigned to more than one cluster, potentially resulting in overlapping clusters. Usually, intra-cluster relationships are not provided as part of the results. However, they are very useful to provide general overviews of large collections, although they usually have to be interpreted by users with certain level of expertise.

Some approaches completely avoid the high dimensionality problem by simply ordering the most used terms in the document and employing the first N terms [RES98], [RES99]. These strategies work well for single document

representation and for association of a limited number of documents, and even for some degree of clustering. How- ever, they also lack a way to clearly relate different documents and display levels of similarity. Other approaches (such as the one by Carey et al. [CHR03]) combine a number of different strategies to allow various views of the same document set, potentially improving focusing and analysis tasks.

Maps resulting from the techniques mentioned above are meant to present various properties of documents, including content similarity, co-citation, term co-occurrence and attribute-value matchings, such as author or publication date.

For a detailed description of available techniques for text document mapping and its applications, systems and challenges, we refer the reader to Borner et al. [BCB03].

Few systems for general multi-dimensional visualization add projections in their tool suite. TheHybrid Information Visualization Environment(HIVE) [RC03], however, imple- ments clustering and layout algorithms to enable exploratory visualization of multidimensional data. An intuitive config- uration process by the user is its main strength. Neverthe- less, the layout algorithms implemented are approximations of the original FDP model, which fail to generate visualizations that group (separate) the data items based on their similarity (dissimilarity).

Projections have nevertheless been intensively explored in the context of visualizing unstructured data, particularly text document collections. Two typical layout approaches in this context areInfoSky[AKS^∗02b,GSAK04]

andGalaxies[WTP^∗95,Wis99], the later one incorporated into the IN-SPIRE^{T M} system (see http://in-spire.pnl.gov/).

Both systems display documents as points in 2D space following a similarity-based layout. Clustering and projections are employed to position groups of documents and to reduce the number of distance calculations. IN-SPIRE builds that into a dimensional reduction approach, similar to Single Value Decomposition or Latent Semantic Index- ing [PRTV98,DFK^∗04]. InfoSky also creates a hierarchical display by embedding structure into the similarity relationships, so a user can focus in and out analogously to manipu- lating a telescope.

Within our research group, we have also developed — and made freely available — a tool suite for multidimensional visualization based on projections, called PEx (the Projection Explorer — http://infoserver.lcad.icmc.usp.br/infovis2/PEx). PEx is an experimental platform written in Java that generates visual maps of multi-dimensional data sets based on classical and novel projection and point placement strategies, with particular focus on text processing.

One way to measure the similarity between text documents in our case is by calculating cosine the distance [MPL06] between the vector representations. The other is using the “Kolmogorov distance” [TMP07]. This