Developing an Extended Task Framework for Exploratory Data Analysis Along the Structure of Time

(1)

K. Matkovic and G. Santucci (Editors)

Developing an Extended Task Framework for

Exploratory Data Analysis Along the Structure of Time

T. Lammarsch, A. Rind, W. Aigner, and S. Miksch

Institute of Software Technology and Interactive Systems (ISIS), Vienna University of Technology, Austria

Abstract

Exploratory data analysis of time-oriented data is an important goal that Visual Analytics has to tackle. When users from real-world domains are asked about time-oriented tasks, they often refer to the unique structure of time (e.g., calendars, primitives, etc.). Several task frameworks have been developed, but none of them combines a complete, systematic approach with explicit attention to the structure of time. To fill this gap, we aim for comple- menting an established task framework with a rule set that explicitly models the structure of time for tasks. This rule set allows to consistently formulate tasks for evaluating time-oriented data analysis methods.

Categories and Subject Descriptors (according to ACM CCS): Information Systems [H.1.1]: Models and Principles—Systems and Information Theory; Computing Methodologies [I.m]: Miscellaneous—

1. Introduction

Human judgement plays a fundamental role in Visual An- alytics (VA) and is primarily mediated through interactive visual interfaces [TC05]. Therefore, it is necessary to take into account the users and be aware of their goals and men- tal models. For exploratory data analysis (EDA) of time- oriented data, they usually consider the structure of time, for example the aspect of calendric systems (see Section3).

Smuc et al. [SML^∗09] present detailed examples resulting from an insight study:

“Starting in the morning, it rises to a peak around 10 or 11 a.m. It then calms down by noon, but there is a second peak around 4 or 5 p.m., after which it decreases again.”

“The first Monday is high, the second is lower, but it rises again on the third and fourth.”

The authors organize these insights using a bottom-up, and also a top-down approach, but both are spread around specific examples, even if they try to generalize from there.

Thus, they cannot make a statement about the completeness of the insights or explain for which kinds of insights a tool is suitable [SML^∗09]. Existing task frameworks, like the one by Andrienko and Andrienko [AA06] approach this problem by starting at the most general and abstract level, where it is possible to define a complete set of tasks. For example, they phrase tasks like “look for the characteristics at a given reference” and provide a formal rule set that describes these.

They do provide details in the form of illustrative example

cases, and only those are formulated according to the aspects of the structure of time. However, these example cases do not cover the design space completely, and the rules account for the unique characteristics of time only implicitly. Thus, there is a gap between the complete formal a-priori definition of tasks, for example performed in the Andrienko and Andrienko [AA06] task framework (AATF), and tasks lists that stem from free exploration, for example shown by Smuc et al. [SML^∗09] or arbitrary consideration by task developers. To evaluate an application in a top-down approach, or to evaluate the completeness of insights in a bottom-up approach, a task taxonomy for the dataset used is necessary.

The structure of time imposes a number of aspects on such taxonomies that are always the same. The actual tasks contain a subset of them. We phrase these aspects for fitting them into the AATF, which is used because it is formally complete but also extendable (see Section 2). We have to adapt the aspects so that they fit into the framework’s formal- ism. The result is a rule set that explains how to phrase tasks in a way that pays heed to the specific characteristics of time- oriented data. Hence, a main contribution of our work is a task framework that guides the development of test cases.

2. Related Work

Many task frameworks exist in the visualization and HCI communities. Most of them are concerned with low-level

c

The Eurographics Association 2012.

DOI: 10.2312/PE/EuroVAST/EuroVA12/031-035

(2)

tasks [AS05,TC05]. Shneiderman [Shn96] presents a task by data type taxonomy, listing seven tasks. Amar et al. [AES05]

determines a taxonomy of ten analytical tasks from 196 concrete task. Also the user intents, which Yi et al. [YKSJ07]

abstracts from the interaction techniques described in aca- demic literature and commercial systems, can be regarded a low-level task framework. These frameworks are general and do not cater to the unique structure of time. Even though Shneiderman tackles time-oriented data, his considerations are limited to intervals and their relation, which our approach covers in Section4.2. Tasks related to time have received special attention in the context of geographic information systems (GIS). Peuquet [Peu94] proposes a triad framework for GIS comprised of three perspectives space, time, and ob- jects. This allows her to discern three possible task types, asking for one perspective while the other two are given.

MacEachren [Mac95] presents a more detailed list of tasks relating to time in maps: Existence of an entity, temporal location, time interval, temporal texture, rate of change, se- quence, and synchronization. Most of these frameworks are simple lists of tasks, where each task is described by typical questions and typical answers. While some frameworks such as [Peu94] are on a very high level of abstraction, for others like [Mac95] and [AES05] it is hard to show completeness.

To overcome these problems, Andrienko and Andrienko formulate a task framework (AATF) [AA06] which allows fine- grained description of exploration tasks and which is complete in respect to their chosen data model and level of abstraction. Their data model separates between referential and characteristic components and explains the data set as a functions that associates eachreferencewith acharacter- istic. In addition, they work withrelations between references or characteristics. In a time series, for example, the time points are references, the values are characteristics, and a 20% increase of value is a relation. Tasks are categorized aslookup, comparison,orrelation seeking,depending on the the target and the constraints of the task. Furthermore, they distinguish between elementary tasks and synoptic tasks.

The former are concerned with the the characteristics or references of separate data elements, whereas the latter exam- ine behaviors or patterns of the data set or subsets of the data. The AATF and its underlying data model only consider the structure of time implicitly, which means that these aspects are considered in principle, but are only phrased in terms of examples and not explicitly on the formal level. Yet, their formal definitions and structured approach allows us to tackle structure of time as an extension of this framework.

Therefore, research in that area also has to be considered.

3. The Structure of Time

According to Aigner et al. [AMST11], time-oriented data can be categorized according to

Scale Time scale can be ordinal, discrete, and continuous.

Scope Temporal data can be given in the form of instants (“point-based”) or intervals (“interval-based”).

Arrangement Time can be linear or cyclic. Cyclic time can be modeled as periodic grouping of granularities.

Viewpoints Temporal data is often given ordered. Variants are branching time, and multiple perspectives.

Granularities Time can be divided according to structures that, for example, derive from calendric systems. A full and formal definition is given by Bettini et al. [BJW00].

They base their work on a view on the discrete time domain that is composed of atomic units called chronons. A granularity is defined as a mapping from integers that rep- resents chronons of the discrete time domain to subsets.

They also define it as the union of a number of granules, making a granule the set of a certain amount of integers from the discrete time domain. Furthermore, they define grouping operations that allow for finer granularities to be grouped into coarser granularities. E.g., if the chronons are days, they can be grouped to months or to years.

Time Primitives Instants are a model for single points in time, intervals for ranges between instants. Spans are du- rations (of intervals) without a fixed position. Time primitives can be used to model scope, but it is possible to consider several point-based data elements an interval. Allen [All83] provides a set of possible relations between intervals which is a time-related expansion of order theory. The relations are further extended by Aigner et al. [AMST11].

Determinacy Time-oriented data can contain uncertainties.

Aigner et al. [AMST11,AMTB05] show that indetermi- nate instants and intervals can be modeled by using a com- bination of standard intervals and spans.

4. Tasks for Time-oriented Data

We intend to apply the AATF, but the aspects of time’s structure require special considerations. In the following section, we add this part to the task framework. The AATF usually considers time as reference. For most EDA cases involving time-oriented data, this approach seems sensible. We will show an important exception in Section4.4.

4.1. Scale

As the task framework itself is rather abstract, it does not have requirements regarding scale. When introducing time as a reference, we still have to consider it. In practical application, time can only be measured discretely. So on the one hand, when two characteristics seem to happen at the same time, humans can decide based on domain knowledge that this is not possible, but they cannot deduce it from the data if the level of discretization is too coarse. Our relations, on the other hand, work for both kinds of data. The difference between discrete time and ordinal time becomes apparent when dealing with relations between references. For ordinal time, the relations between two referencesr1,r2∈R_Owith RObeing the ordinal time domain, are:

r1=r2: “r1andr2happen at the same time”

r1<r2: “r₁happens beforer2”

(3)

r₁>r₂: “r₂ happens before r₁” Logical combinations of those relations are also possible.

For discrete time, all relations between two references r1,r2∈RDwithRDbeing the discrete time domain can be brought to the formr1−r2=d, withd∈Z, “there ared chronons betweenr1andr2.”

4.2. Time Primitives

Modeling time primitives also allows for including the different variants of scope as well as determinacy. When tasks with time as reference are formulated considering time primitives, the possible relations between them have to be used in relations between references. Each reference can be an instant in time, or an interval in time. Aigner et al. [AMST11, p. 59] show variants without considering scale, we formulate them first for ordinal scale, then for discrete scale: An interval is a range in time that starts at an instant and finishes at an instant. Letr₁,r₂,s₁,s₂,e₁,e₂∈R_Obe instant references in the ordinal time domain andr1,r₂ be interval references wherer₁starts at the instants₁and finishes at the instante₁ whiler2is similarly given bys2,e2. For instants, the cases are the same as shown in Section4.1. Following the nota- tion of Allen [All83], new cases (that partially overlap) are:

r1<s1: “r1happens beforer1” r₁=s₁: “r₁startsr₁”

r1=e1: “r1finishesr1”

s₁<r₁<e₁: “r₁happens duringr₁” e1<s2: “r1happens beforer2”

s1<s2∧s2≤e1∧e1<e2: “r1overlapsr2” s1=s2∧e1<e2: “r1startsr2”

s2<s1∧e1<e2: “r1happens duringr2” s1>s2∧e1=e2: “r₁finishesr2”

The relation of two intervals meeting each other cannot be formulated for ordinal data.

When considering a discrete scale, we can again use the differenced∈Zas the number of chronons between instants r1,r2,s1,s2,e1,e2∈RD:

s1−r1=d: “r₁happens d chronons beforer1” s1=r1: “r1startsr1”

e₁=r₁: “r₁finishesr₁”

r1−s1=d∧d>0∧e1−r1>0: “r1happens duringr1,d chronons after the start”

s2−e1=d∧d>0: “r1happensdchronons beforer2” s2−e1=1: “r1meetsr2”

s2−s1 =d1∧d1>0∧e2−e1 =d2∧d2 >0∧s2≤e1:

“r1 overlapsr2, startingd1chronons earlier and endingd2

chronons earlier”

s1=s2∧e2−e1=d∧d>0: “r1startsr2, endingdchronons earlier”

s1−s2=d1∧d1>0∧e2−e1=d2∧d2>0: “r1happens duringr2, startingd1chronons later and endingd2chronons earlier”

s1−s2=d∧d>0∧e1=e2: “r1 finishesr2, starting d chronons earlier”

Finally, it is possible that spans are references. However, they can only be related to other spans, and then they can be treated like integers.

4.3. Viewpoints

An ordered dataset is the normal case and branching time is usually considered in conjunction with predicting values.

For EDA, this is out of scope, but it is an important case when advancing to further tasks on a broader scope. Multiple perspectives can be modeled in the AATF by defining each one as a data function. All the tasks that consider more than one function can access these perspectives. For example, the task to compare two different attributes corresponding to the same reference ?y1,y2,λ:f1(r) =y1;f2(r) =y2;y1λy2

[AA06, p. 66], can be phrased “compare the degree of cus- tomer satisfaction as reported by group 1 with the degree as reported by group 2”. Multiple perspectives can also be used to model dynamic systems, like the interplay between valid time and transaction time in temporal databases.

4.4. Granularities

Granularities are formed by grouping time, so in many cases it makes sense to consider granule references, like it is usually done with time. As the AATF is based on a symmet- ric data model, this does not limit the possibilities. We will integrate granularities in a way that is most convenient according to one of two different task groups: (1) Performing the tasks as defined in the AATF, but basing the reference domain on granularities instead of flat and linear time. (2) Finding the granularities that are relevant in the first place.

Applications of the calendar aspect of time have so far only been considered on a basis where the important granularities in a dataset are already known. Finding those granularities is a challenging task on its own right that we also describe.

4.4.1. Granularities in the Reference Domain

Without granularities, the references for time-oriented data are timestamps. To use granules as a measure, we need to count them. Bettini et al. [BJW00] define a way to assign labels to granules. We use a simplified form that is more compatible to the AATF: Letg(t) =l;t∈RD;l∈Zbe the label function that maps a chronon in the discrete time domain to an integer label. This label refers to a granule of a granularity—for example, 1 can have a text equivalent of January. A dataset using granularities therefore has a data functions ˙f(l) =c, mapping a granule label reference to a characteristicc. A conventional data function f(x)can be mapped to ˙f(l). However, as granularities are formed by grouping, the characteristics also need to be grouped. Possi- bilities include replacing one characteristic by a set of them, or aggregating them, for example by mean, median, or sum.

Furthermore, a data function can be formed ˙f(l₁;l2;. . .) =c, having different values for granule combinations, like Jan- uary in 1970, and so on. All tasks working on chronons can

(4)

also be performed working on granules of one granularity.

Furthermore, it is possible to use two different functions using different granularities, but stemming from the same orig- inal function, when tasks with two functions are performed.

For example, ?y1,y2,λ: ˙f1(l1;l2) =y1; ˙f2(l2) =y2;y1λy2, can mean “compare the value in January 1970 with the av- erage of 1970”. Behavior comparison tasks get an important meaning in conjunction with granularities. Often, a pattern is characterized by telling that a range in time belonging to one granularity is similar to another granularity. For example, “bridging days are similar to holidays”.

4.4.2. Finding Granularities

When searching the granularities that are important for a dataset, many comparisons with different label functions are needed. This is easier when considering the label functions equivalent to data functions. So in that case, the chronons are the references and the labels are the characteristics. A simple task can look like this: ?y,l,x:f(x) =y;g(x) =l;yΛl and an example would be “Which Januarys have high aver- age values?”. The same task could then be performed with other granularities, till something significant shows up, ren- dering one granularity interesting. More suitable for finding granularities seem to be connectional tasks (see AATF [AA06, p. 124]):ρ(f(x),g(x)|x∈R)can be considered the mutual behavior of the data and a granule label, which can be directly translated to the question “does this granularity have an influence?”. The scatterplots used as an example in the AATF [AA06, p. 126], can only show the influence of one granularity at a time, but visualizations, like GROOVE [LAB^∗09], based on the recursive pattern technique by Keim et al. [KKA95], can show the mutual behavior of one data characteristic and four or more different granularities.

4.5. Application and Rule Set

If we consider one of the insights from Section1, like “The first Monday is high, the second is lower, but it rises again on the third and fourth.” [SML^∗09], the task would be “describe the behavior of the characteristic value over the Mondays”

which can be formulated ?p:β(f(l; 1)|l˙ ∈Z)≈p, wherel is a variable week label and the second parameter gives the day being always Monday. AATF also allows to spread time across more dimensions in the form ?p:β(f(x₁;x₂)|x₁,x2∈ R)≈p, but there are no rules how to distribute time.

Another example: Data of a stock index and individual buy and sell orders about the stock are to be analyzed.

?R₁,R₂,p₁,p₂ :R₁ΨR₂;β(f(x)|x ∈R₁) ≈ p₁;β(f(x)|x∈ R2)≈p2;p1Λp2 could lead to the question “are there any times with many sales while the stock price is dropping?”, but this is only one of many. For the data dimension, many and few transactions, falling and rising stocks are well- known terms, but what about time? Our Section4.2gives a full list of relations: starts, finishes, happens during, happens before, meets, overlaps, starts, happens during, finishes.

People developing a task set might need to decide which of them to include, and whether they need only ordinal or discrete relations. But they have a list to check, and might, for example, find the important case of “are there any times with many sell orders meeting an interval when the stock price is dropping?”—a possible cue for insider trading.

To discern if all tasks have been found (or to state which tasks have to be searched), task developers have to phrase all relevant tasks by going through the AATF and for each task, going through the aspects mentioned in this paper:

Scale/Time Primitives When the task involves relations on time, go through all temporal primitives aspects according to the scale of the dataset.

Viewpoints When the dataset has multiple viewpoints, phrase all tasks involving two functions accordingly, call- ing the different viewpoints.

Granularities Phrase the tasks for finding the appropriate granularities. Only a finite list of granularities can be checked, but this list can be expanded by an automated search for cycles in the dataset. Perform the tasks to actu- ally find the granularities. Phrase the tasks involving relations on time using the granularities found.

A complete list would most likely exceed the number of tasks that can be performed in a study, but task developers can use it to make sure nothing important is missed.

5. Conclusion and Future Work

First, we have listed restrictions of state-of-the-art task frameworks. Second, we have described the structure of time, which is an important influence on time-oriented data, and shown that it is not considered sufficiently by existing task frameworks. To help setting up data-centric tasks in order for top-down analysis or to evaluate the results from a bottom-up analysis, we have then provided a rule set for in- tegrating the structure of time into a complete and formal task framework. This rule set allows to consistently formulate tasks for evaluating time-oriented data analysis methods.

So far, our rule set does not contain concrete tasks. These tasks can be formulated for time-oriented data in general, but in practice, it will be more important to formulate them directly for a dataset that will be used to test various systems.

So the main part of future work will be the application of this work. The tasks as defined in the AATF [AA06] are used as a basis for EDA. Further task groups that VA intends to solve are forecasting and developing options [TC05,KMS^∗08].

Task frameworks involving these groups also have to consider the structure of time.

Acknowledgments This work was supported by the FWF Austrian Science Fund. Project number: P22883. We also wish to thank Heidrun Schuman and Christian Tominski from the University of Rostock for their valuable input.

(5)

References

[AA06] ANDRIENKON., ANDRIENKOG.:Exploratory analysis of spatial and temporal data: a systematic approach. Springer, Berlin, 2006.1,2,3,4

[AES05] AMARR., EAGANJ., STASKOJ. T.: Low-Level components of analytic activity in information visualization. InProc.

IEEE Symp. Information Visualization (INFOVIS 2005)(2005), pp. 111–117.2

[All83] ALLENJ.: Maintaining knowledge about temporal intervals. Communications of the ACM 26, 11 (1983), 832–843. 2, 3

[AMST11] AIGNERW., MIKSCHS., SCHUMANNH., TOMIN- SKIC.:Visualization of Time-Oriented Data. Springer, London, UK, 2011.2,3

[AMTB05] AIGNERW., MIKSCHS., THURNHERB., BIFFLS.:

PlanningLines: Novel Glyphs for Representing Temporal Uncer- tainties and their Evaluation. InProc. 9th Int. Conf. Information Visualisation (IV 2005)(2005), Banissi E., et al., (Eds.), IEEE, pp. 457–463.2

[AS05] AMAR R. A., STASKO J. T.: Knowledge precepts for design and evaluation of information visualizations.IEEE Trans.

Visualization and Computer Graphics 11, 4 (2005), 432–442.2 [BJW00] BETTINI C., JAJODIA S., WANG S.: Time Granu-

larities in Databases, Data Mining and Temporal Reasoning.

Springer, Berlin, 2000.2,3

[KKA95] KEIMD., KRIEGELH.-P., ANKERSTM.: Recursive Pattern: A Technique for Visualizing very Large Amounts of Data. InProc. IEEE Visualization (Vis95)(1995), pp. 279–286.

4

[KMS^∗08] KEIM D., MANSMANN F., SCHNEIDEWIND J., THOMASJ., ZIEGLERH.: Visual Analytics: Scope and Chal- lenges. InVisual Data Mining, Simoff S. J., Böhlen M. H., Mazeika A., (Eds.), LNCS 4404. Springer, Berlin, 2008, pp. 76–

90.4

[LAB^∗09] LAMMARSCHT., AIGNERW., BERTONEA., GÄRT- NERJ., MAYRE., MIKSCHS., SMUCM.: Hierarchical Tempo- ral Patterns and Interactive Aggregated Views for Pixel-based Vi- sualizations. InProc. Int. Conf. Information Visualization (IV09) (2009), IEEE, pp. 44–49.4

[Mac95] MACEACHREN A. M.: How Maps Work. Guilford Press, New York, 1995.2

[Peu94] PEUQUETD. J.: It’s about time: A conceptual framework for the representation of temporal dynamics in geographic information systems.Annals of the Association of American Ge- ographers 84, 3 (1994), 441–461.2

[Shn96] SHNEIDERMANB.: The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. InProc. IEEE Symp. Visual Languages(1996), pp. 336–343.2

[SML^∗09] SMUCM., MAYRE., LAMMARSCHT., AIGNERW., MIKSCHS., GÄRTNERJ.: To Score or Not to Score? Tripling Insights for Participatory Design.IEEE Computer Graphics and Applications 29, 3 (2009), 29–38.1,4

[TC05] THOMASJ., COOKK.: Illuminating the Path: The Re- search and Development Agenda for Visual Analytics. IEEE, 2005.1,2,4

[YKSJ07] YIJ. S., KANGY. A., STASKOJ. T., JACKOJ. A.:

Toward a deeper understanding of the role of interaction in information visualization. IEEE Trans. Visualization and Computer Graphics 13, 6 (2007), 1224–1231.2