RecSysWiki - User contributions [en]

User:Usabart

2013-10-16T03:55:56Z

Usabart:

== How to include students in your conference ==

'''TL;DR: if we really care about the future of our field, we need to let all students register for the student rate, and allow them to attend the banquet.'''

RecSys2009 was my very first conference, and it was an amazing experience. I was not even a PhD student yet, but the European project I was working on had some funding, and they decided to let me attend. I felt welcomed and included in this small and vocal community. Sure, talking to the bigwigs like Peal, Joe and John was still a bit scary, but I loved how friendly and open everyone was. I presented a poster there, and the experience to "sell" my work to like-minded professionals was extremely useful... it got me excited about doing more research and contributing to this wonderful field.

To put my money where my mouth is, next year I organized my own workshop (UCERSTI), and it was amazing to see that these famous senior researchers attended it, and liked it. By now, I started to feel like a regular myself, inviting new members into the community that is recsys.

Over time my conference attendance has steadily increased; this year I am attending 8 of them! But I always seem to come back to recsys... even though my research is not central to the topic, I still regard it as my home conference. Why? Because it is a small community that is very accessible to students like me. Despite my hyperactive demeanor and the occasional vitriol on twitter, I feel like people accept me. In fact, the recsys regulars respect and include all students.

But this year there were two things that put students at a disadvantage. It's not to say that students suddenly didn't feel welcome anymore, but these developments do endanger student inclusion and should therefore, in my opinion, be discontinued. These two things are 1) the (reversed) decision to make authors pay the full non-student rate, and 2) the exclusion of students from the banquet.

'''Why charging students a lower rate is a good idea'''
Let's face it: conferences, especially in computing, are expensive. Half the time you have to fly to a different continent, stay at a hotel, pay a registration fee... and someone has to pay for this. Universities typically have a limited budget, so anything that makes it cheaper for students to attend helps enormously. In fact, I've talked to a number of professors, and they tell me that they probably wouldn't let their students attend if they'd have to pay the full rate.
This year the recsys organizers decided that all authors (even students) would have to pay the full conference rate. This leads to the weird situation that presenting students pay a higher rate than non-presenting students. If that sounds backwards to you it's because it is. If anything presenters should pay less: they contribute to the conference by having a paper; they make the conference happen.
After some angry emails and tweets, the organization decided to revert this decision. This is a relief... I am afraid that if this hadn't happened, most student papers would have been presented by their advisors instead, and these students would have missed out on this extremely useful opportunity to learn how to present their work.

'''Why not allowing students to the banquet is a bad idea'''
Only last night I found out that the "all inclusive" student registration was a lie: contrary to virtually every other conference, students were not invited to the banquet. I was stopped at the door, and I was about to throw a tantrum, but instead I just brazenly bluffed myself in (how else would I have been able to give the Foster City presentation?)
The absence of students became painfully clear when none of the runners-up for the best paper award (all students) were not there to accept their recognition. For those (and all other) students this is a missed opportunity to network for industry or post-doctoral positions.
But the problem I have with this policy runs deeper. The conference chair in his presentation highlighted how important students are to the community, but excluding students from things like the banquet makes an implicit statement that students are a separate, less important part of it. If students are really that important, they deserve the same (or an even better) treatment as professors and practitioners.

'''Funding and moderation'''
The most obvious response to my argument is of course "there is not enough money". But we have to think about where the money comes from and where it goes to. I feel that this year the funding was lower than previous years, and that's understandable with the financial crisis, but we have to be very careful with putting these burdens on the community equally, not just on the students. Maybe we don't need a catered lunch. Maybe the banquet doesn't need unlimited wine. Maybe we don't need fancy cakes at every coffee break. These things are all unnecessary luxuries that we've gotten used to, but should easily be able to do without. I'm sure the profs will give up their cake in return for more student involvement.

'''How to make it even better'''
So I want to make a plea for next year: let all students register for the student rate, and make sure that they are allowed to attend every part of the conference. Beyond that, I think we can make it even better for students, especially for new attendees. Lately I've been to some business school conferences, and they typically have a "new member orientation" session during the first coffee break. This session shows new attendees what the conference is about, explains some of the traditions, and encourages them to actively participate. I think we should do something like that at recsys as well.

'''Because we rely on students to keep this wonderful community growing!'''

Beyond Algorithms: An HCI Perspective on Recommender Systems

2011-03-01T08:01:53Z

Usabart: Created page with "== Groundbreaking work? == In the field of recommender systems, the paper "[http://www.inf.unibz.it/~ricci/ATIS/papers/swearingen01beyond.pdf Beyond Algorithms: An HCI Perspectiv..."

== Groundbreaking work? ==
In the field of recommender systems, the paper "[http://www.inf.unibz.it/~ricci/ATIS/papers/swearingen01beyond.pdf Beyond Algorithms: An HCI Perspective on Recommender Systems]" by Kirsten Swearingen and Rashmi Sinha is often cited as one of the first papers to address the usability of recommender systems. In marketing and information systems research, a paper by Gerald Haübl and Valerie Trifts titled "[http://www.ecs.umass.edu/~eshittu/Toyin/Consumer%20Decision%20Making%20In%20Online%20Shopping%20Environment.pdf Consumer Decision Making in Online Shopping Environments: The Effects of Interactive Decision Aid]" is often cited as the first attempt, and this paper predates Swearingen and Sinha.

== Short summary ==
The paper studies several different existing book and movie recommender systems from both a [[Quantitative user experiments or field trials|quantitative]] and [[Qualitative user-studies|qualitative]] perspective, using both [[Objective evaluation measures|objective]] and [[Subjective evaluation measures|subjective]] evaluation metrics. Swearingen and Sinha find that users prefer friends' recommendations over system recommendations. Moreover, they conclude that the usefulness of a recommender system can be predicted by the number of good and useful recommendations it provides, the detail of its item descriptions, the transparency of its reasoning, and the number of trust-generating recommendations (recommendations the user already knows but likes) it provides. The time and effort it takes to get to the recommendations does not seem to matter, and the total number of recommendations the system provides also has no effect.

== Critical reflection ==
The paper by Swearingen and Sinha is also often cited as a '''good''' example of HCI research in recommender systems. However, the paper has several methodological deficiencies that may force us to weaken this favorable appraisal.

As the system studied in this paper differ from each other on several dimensions, it becomes very hard to attribute the differences between the system to a specific quality of the system. The authors try to get around this by getting direct subjective measures of these qualities. However, since these qualities are correlated within the different systems, the effects can be confounded. Tight statistical control on additional factors can potentially solve this issue, and a regression analysis would provide such control. The authors, however, chose to do a correlation analysis, in which there is no statistical control for the correlation between predictor variables. For instance, let's say that all systems that give detailed item descriptions also provide more insight into their reasoning, and all systems that do not provide detailed descriptions also do not provide this insight. In that case, insight and description detail are highly correlated, and it is impossible to disentangle their effects on the usefulness of the recommender system.

Another problem is that the described experiment only has 19 participants. The authors try to get around this by letting each user use each of the 6 systems. They avoid order effects by randomizing the order of presentation, but they do not test for the significance of an order effect. Furthermore, it seems like their correlational measures treat the six observations per participant as separate data points, while in fact these data are very likely to be correlated (as they come from the same user). The power of the correlation tests is thereby artificially inflated, and the reported results are likely to be insignificant after controlling for repeated measurements.

Moreover, several correlations include measurements of time, or numbers/percentages. These metrics usually do not have a homogeneous error distribution. Due to heteroscedasticity, the calculated error of the Pearson correlation is likely to be incorrect.

Furthermore, the correlations in their Table 1 are the only metrics for which significance tests are provided. Throughout the rest of the text, the authors repeatedly make claims about correlations or differences between systems, without providing the needed statistical evidence for these claims:
* "the perceived usefulness of a recommender system went up with an increase in the number of trust-generating recommendations (p6)"
* "This small design change correlated with a dramatic increase in % useful recommendation (p7)"
* "Navigation and layout seemed to be the most important factors--they correlated with ease of use and perceived usefulness of system (p7)"
* " % Good Recommendations was positively related to Perceived System Transparency (p8)"

Also, it is unclear how "RS usefulness" was measured. It seems that this was a single questionnaire item, which is inadequate for robust measurement (unless the item is thoroughly validated in previous studies). If this was in fact measured with multiple items, then the authors should have reported a reliability measure for the constructed scale.

Finally, the authors compare (although again not statistically) the perceived quality of recommendations provided by the system with those provided by the users' friends. There is however no explanation how these friends' recommendations were gathered, or how they were presented to the user. A different presentation method, or merely the fact that users knew these recommendations were provided by their friends, may have caused the difference in recommendation quality.

The authors do provide insightful qualitative comments from their users. Such qualitative findings however do not warrant the generalization of the results beyond the studied systems (something the authors acknowledge in the Limitations section, but seem to ignore in the rest of the paper when drawing conclusions), and arguably also raise doubts about the ascribed status of the paper as a seminal work.
[[category:user-centric evaluation]]

Standardization of user-centric evaluation

2011-03-01T04:56:21Z

Usabart:

== Impossible standardization ==
Standardization of user-centric research is difficult because the procedures, methods, and metrics used are highly context-dependent. In other words: Two user-centric research projects rarely use exactly the same system, a similar set of users, or the same evaluation metric. More importantly, because both usability and user experience are multi-dimensional, they rarely have the same goal. Rigidly standardized evaluation metrics are thus not feasible in user-centric research. However, early attempts have been made to integrate user-centric research findings under a common framework.

== Generic frameworks ==
The concept of [[usability]] can be traced back to the cognitive psychological concepts of perception, attention, and memory. An early conceptualization of usability is provided by Don Norman in his seminal work [http://interface-design-10.wdfiles.com/local--files/october-1/DesignofEverydaythings.pdf The Design of Everyday Things]. Norman describes the interaction between users and systems to exist of two gulfs: the gulf of execution and the gulf of evaluation. In the gulf of execution, the user, who has a goal, has to formulate an intention, translate this into the correct action sequence, and then perform this action sequence using the system's interface. In the gulf of evaluation, the user has to perceive the state of the system, interpret this state, and then evaluate the state in relation to the original goal. Interaction is thus a perpetual bridging of the gulfs of execution and evaluation. In order to bridge the gulfs effectively, users create an internal mental model (the Use model) of the system, which represents their beliefs of how the system works. The creation of such a Use model is helped by the feedforward and feedback provided by the system. The more the Use model resembles the actual way the system works (the System model), the better the usability of the system. Jakob Nielsen provides a classification of [http://www.useit.com/papers/heuristic/heuristic_list.html 10 usability evaluation heuristics]. Nielsen uses these heuristics as guidelines for his Heuristic Evaluation method. However, the guidelines can also be used to categorize usability problems.

The concept of [[user experience]] can be traced back to the social psychological concepts of attitude, intention, and behavior. The most influential model in this respect is the [http://www-unix.oit.umass.edu/~psyc661/pdf/tpb.obhdp.pdf Theory of Planned Behavior (TPB)] (and its predecessor the Theory of Reasoned Action, TRA) by Icek Ajzen and Martin Fishbein. This model claims that our behavioral intentions are based on attitudinal and normative evaluations, and that these intentions, given enough behavioral control, lead to actual behaviors. The attitudinal part of this model has been adopted in the [http://www.istheory.yorku.ca/Technologyacceptancemodel.htm Technology Acceptance Model (TAM)]; the normative part of TPB has been adopted in the [http://www.istheory.yorku.ca/UTAUT.htm Unified Theory of Acceptance and Use of Technology (UTAUT)].

== A descriptive framework ==
Bo Xiao and Izak Benbasat have made an extensive effort to integrate existing work on user-centric recommender system evaluation in a framework in their 2007 paper titled "[http://opim.wharton.upenn.edu/~kartikh/reading/ib1.pdf e-Commerce Product Recommendation Agents: Use, Characteristics, and Impacts]". Although the paper focuses mainly on the body of research available in the field Information Systems, it also includes some work from the Human-Computer Interaction and Recommender Systems fields.

== An integrative framework ==
Bart Knijnenburg et al. provide an [http://www.usabart.nl/portfolio/framework.pdf evaluation framework] that can be used as a guideline conducting and analyzing quantitative user-centric research on recommender systems. Their approach focuses on [[quantitative user experiments or field trials]], and includes [[Subjective evaluation measures|subjective]] as well as [[Objective evaluation measures|objective]] evaluation measures. Specifically, it reasons that objective system aspect, as perceived by the user (subjective system aspects), impact their attitude (user experience) and behavior (interaction). Attitude and behavior are also influenced by personal and situational characteristics. The framework by Knijnenburg et al. is not meant as a standardized evaluation metric to provide a "experience score" for a single recommender system, but as a guideline for controlled experiments that compare two or more systems that systematically differ in one or more objective system aspects. Using the framework, researchers can measure and explain the influence of these system aspects on the user experience.

== A multi-dimensional metric ==
Pearl Pu and Li Chen provide also provide a [http://ceur-ws.org/Vol-612/paper3.pdf user-centric evaluation framework for recommender systems]. In contrast to Knijnenburg et al., they provide a specific list of questionnaire that can be used to provide a standardized, multi-dimensional evaluation of a recommender system. The framework breaks down into perceived system qualities (recommendation quality, interaction adequacy, interface adequacy; similar to Knijnenburg et al.'s subjective system aspects), beliefs (perceived ease of use, perceived usefulness, and control and transparency; in Knijnenburg et al.'s framework these are divided over subjective system aspects and experience), attitudes (a more generic evaluation of the user experience in terms of satisfaction and trust; similar to Knijnenburg et al.'s experience), and behavioral intentions (intention to use and return to the system; partly similar to Knijnenburg et al.'s interaction).

== Benefits of using a framework ==
A recommender systems framework can be used to integrate existing and new research under a common denominator. Standardized terms can be used to compare research findings and to uncover gaps or inconsistencies in existing work. If a framework is adequately validated, it allows for a more robust measurement of subjective concepts, and the possibility to [[Simplifying user-centric evaluation|simplify evaluation]].
[[Category:User-centric evaluation]]

Simplifying user-centric evaluation

2011-03-01T04:55:49Z

Usabart:

== Reducing complexity ==
Aside from being time and resource intensive, user-centric evaluation of recommender systems can be a complicated endeavor. This especially holds for [[quantitative user experiments or field trials]] using one or more [[subjective evaluation measures]]. For quick iterations on an existing system, researchers may instead use [[qualitative user-studies]], but such [[formative evaluation]] methods cannot be used to generate statistically conclusive research findings.

Some work has been done to simplify quantitative evaluation by reducing the complexity of measurement. Instead of creating a questionnaire with 5-7 items per concept, one can ask only the one or two most accurate ones. Taking this one step further, one can forego asking questions altogether, and instead measure certain behavioral aspects that are proven to correlate with specific subjective constructs. Previous research can be used to select questions or behavioral measures that robustly measure certain concepts. [[Standardization of user-centric evaluation|Standardized evaluation frameworks]] are helpful sources of robust metrics in this respect.

Once measurement has been simplified, the next step is to simplify evaluation. By only selecting one or two questions per concept, one avoids the need for Structural Equation Modeling or Factor Analysis. Instead of path models, simple linear regression or correlation can be used to evaluate the effects, provided that previous work has considered possible mediation effects.

== Existing simplifications ==
Two simplifications have been proposed. Pearl Pu and Li Chen offer a simplified version of their [http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-612/paper3.pdf user centric evaluation framework] that is basically a subset of their questionnaire. Similarly, Knijnenburg et al. propose the idea of simplifying their [http://www.usabart.nl/portfolio/framework.pdf evaluation framework] as an evaluation toolbox.
[[category:User-centric evaluation]]

Category:User-centric evaluation

2011-03-01T04:44:59Z

Usabart:

Whereas originally the field of recommender systems heavily focused on offline evaluation, recently awareness has grown that the [[usability]] and [[user experience]] of recommender systems should be tested in online evaluations with real users. User-centric evaluation methods can be broadly categorized into [[qualitative user-studies]] and [[quantitative user experiments or field trials]]. User studies typically combine [[objective evaluation measures]] with [[subjective evaluation measures]], typically in the form of [[design critiques]], [[interviews]], and [[questionnaires]].

User-centric evaluation has had difficulties gaining popularity as an evaluation method, because it is often difficult to test new algorithms or systems with real users. Early examples in the recommender systems literature have been of [[Paper:Swearingen & Sinha (2001), Beyond Algorithms: An HCI Perspective on Recommender Systems|questionable quality]]. There have been a few suggestions for [[Standardization of user-centric evaluation|standardization]] and [[Simplifying user-centric evaluation|simplification]] of the user-centric evaluation process.

[[Category:Evaluation]]

Category:User-centric evaluation

2011-03-01T04:44:36Z

Usabart:

Whereas originally the field of recommender systems heavily focused on offline evaluation, recently awareness has grown that the [[usability]] and [[user experience]] of recommender systems should be tested in online evaluations with real users. User-centric evaluation methods can be broadly categorized into [[qualitative user-studies]] and [[quantitative user experiments or field trials]]. User studies typically combine [[objective evaluation measures]] with [[subjective evaluation measures]], typically in the form of [[design critiques]], [[interviews]], and [[questionnaires]].

User-centric evaluation has had difficulties gaining popularity as an evaluation method, because it is often difficult to test new algorithms or systems with real users. Early examples in the recommender systems literature have been of [[Paper:Swearingen & Sinha (2001), Beyond Algorithms: An HCI Perspective on Recommender Systems|questionable quality. There have been a few suggestions for [[Standardization of user-centric evaluation|standardization]] and [[Simplifying user-centric evaluation|simplification]] of the user-centric evaluation process.

[[Category:Evaluation]]

Simplifying user-centric evaluation

2011-03-01T04:33:41Z

Usabart: Created page with "== Reducing complexity == Aside from being time and resource intensive, user-centric evaluation of recommender systems can be a complicated endeavor. This especially holds for [[..."

Category:User-centric evaluation

2011-03-01T04:27:58Z

Usabart:

Whereas originally the field of recommender systems heavily focused on offline evaluation, recently awareness has grown that the [[usability]] and [[user experience]] of recommender systems should be tested in online evaluations with real users. User-centric evaluation methods can be broadly categorized into [[qualitative user-studies]] and [[quantitative user experiments or field trials]]. User studies typically combine [[objective evaluation measures]] with [[subjective evaluation measures]], typically in the form of [[design critiques]], [[interviews]], and [[questionnaires]].

User-centric evaluation has had difficulties gaining popularity as an evaluation method, because it is often difficult to test new algorithms or systems with real users. Early examples in the recommender systems literature have been of questionable quality. There have been a few suggestions for [[Standardization of user-centric evaluation|standardization]] and [[Simplifying user-centric evaluation|simplification]] of the user-centric evaluation process.

[[Category:Evaluation]]

Standardization of user-centric evaluation

2011-03-01T04:26:59Z

Usabart: moved Standardization of user-centric evaluation metrics to Standardization of user-centric evaluation

Standardization of user-centric evaluation metrics

2011-03-01T04:26:59Z

Usabart: moved Standardization of user-centric evaluation metrics to Standardization of user-centric evaluation

#REDIRECT [[Standardization of user-centric evaluation]]

Standardization of user-centric evaluation

2011-03-01T04:14:11Z

Usabart:

Standardization of user-centric evaluation

2011-03-01T04:10:58Z

Usabart: Created page with "== Impossible standardization == Standardization of user-centric research is difficult because the procedures, methods, and metrics used are highly context-dependent. In other wo..."

Category:User-centric evaluation

2011-03-01T02:57:12Z

Usabart:

Whereas originally the field of recommender systems heavily focused on offline evaluation, recently awareness has grown that the [[usability]] and [[user experience]] of recommender systems should be tested in online evaluations with real users. User-centric evaluation methods can be broadly categorized into [[qualitative user-studies]] and [[quantitative user experiments or field trials]]. User studies typically combine [[objective evaluation measures]] with [[subjective evaluation measures]], typically in the form of [[design critiques]], [[interviews]], and [[questionnaires]].

User-centric evaluation has had difficulties gaining popularity as an evaluation method, because it is often difficult to test new algorithms or systems with real users. Early examples in the recommender systems literature have been of questionable quality. There have been a few suggestions for [[Standardization of user-centric evaluation metrics|standardization]] and [[Simplifying user-centric evaluation|simplification]] of the user-centric evaluation process.

[[Category:Evaluation]]

Category:User-centric evaluation

2011-03-01T02:56:32Z

Usabart:

Whereas originally the field of recommender systems heavily focused on offline evaluation, recently awareness has grown that the [[usability]] and [[user experience]] of recommender systems should be tested in online evaluations with real users. User-centric evaluation methods can be broadly categorized into [[qualitative user-studies]] and [[quantitative user experiments or field trials]]. User studies typically combine [[objective evaluation measures]] with [[subjective evaluation measures]], typically in the form of [[design critiques]], [[interviews]], and [[questionnaires]].

User-centric evaluation has had difficulties gaining popularity as an evaluation method, because it is often difficult to test new algorithms or systems with real users. Early examples in the recommender systems literature have been of questionable quality. There have been a few suggestions for [[standardization|Standardization of user-centric evaluation metrics]] and [[simplification|Simplifying user-centric evaluation] of the user-centric evaluation process.

[[Category:Evaluation]]

Summative evaluation

2011-03-01T02:50:18Z

Usabart:

The goal of '''summative evaluation''' is to find out whether feature P causes quality X (regardless of the system that uses feature P). The usual summative approach is to test system A versus system B, where these systems only differ on feature P, and then measuring quality X to see if it differs between the two systems. Summative methods include A/B tests (field trials) and controlled experiments.

Researchers planning to do a [[user-centric recommender system evaluation]] need to be aware of the [[trade-offs between formative and summative evaluation]].
[[Category:Evaluation]]
[[Category:User-centric evaluation]]

Trade-offs between formative and summative evaluation

2011-03-01T02:50:05Z

Usabart:

[[Formative evaluation]] is quicker and cheaper to conduct than summative evaluation, and the results are more straightforward. However, the method is less suitable for adaptive systems (including recommender systems), because it is hard to find out what exactly causes (problems with) the usability or user experience. In [[summative evaluation]] you need to define hypotheses beforehand, you can only focus on a few aspects at a time, and the analysis is more complex. You also need more test users to ascertain adequate statistical power. On the other hand, it is easier to test adaptive systems because you can single out the effect of specific features. Moreover, the results of summative evaluation are more generalizable and can be statistically validated.
[[Category:Evaluation]]
[[Category:User-centric evaluation]]

Formative evaluation

2011-03-01T02:49:32Z

Usabart:

In formative research, the goal is to improve a certain system A in terms of a certain quality X. The usual formative approach is to test system A qualitatively, focusing on quality X, looking for improvements on certain features P, Q and R that will increase quality X.

Researchers planning to do a [[user-centric recommender system evaluation]] need to be aware of the [[trade-offs between formative and summative evaluation]].
[[Category:Evaluation]]
[[Category:User-centric evaluation]]

Qualitative user-studies

2011-03-01T02:49:19Z

Usabart:

Typical for qualitative user studies is their reliance on rich data from few test participants. These types of studies are often used as [[formative evaluation]]. Results of qualitative user studies are typically not statistically validated, and therefore the generalizability of the results is limited. On the other hand, the depth of evaluation makes it ideal for quick design iterations of prototype systems.

Formative methods include [[think-aloud user studies]], [[heuristic evaluation]], and [[cognitive walkthrough]].
[[Category:Evaluation]]
[[Category:User-centric evaluation]]

Main Page

2011-03-01T02:48:38Z

Usabart:

{{DISPLAYTITLE:<span style="display:none">{{FULLPAGENAME}}</span>}}
<div id="mainpage"></div>
__NOTOC__

{|style="width:100%;margin-top:+.9em;background-color:#fcfcfc;border:1px solid #ccc"
|style="width:56%;color:#000"|
{|style="width:100%;border:solid 0px;background:none"
|-
|style="width:100%;text-align:center;white-space:nowrap;color:#000" |
<div style="font-size:162%;border:none;margin: 0;padding:.1em;color:#000">Welcome to the Recommender Systems Wiki (RecSysWiki) </div>
<div style="top:+0.2em;font-size: 95%">''to facilitate the sharing of information on all aspects of [[Recommender System|Recommender Systems]]''</div>
<div id="articlecount" style="width:100%;text-align:center;font-size:85%;">{{NUMBEROFPAGES}} pages and {{NUMBEROFARTICLES}} articles in the RecSysWiki as of {{CURRENTDAYNAME}}, {{CURRENTMONTHNAME}} {{CURRENTDAY}}, {{CURRENTYEAR}}</div>
<div style="width:100%;text-align:center;font-size:85%;">started on February 10th, 2011</div>
|}
|}

This Wiki is intended as a space for everything related to the topic of [[Recommender System|Recommender Systems]].

Registration is open to anyone who wishes to contribute.

Currently there are very few pages in the wiki but we're hoping content will be added over time.


{|style="border-spacing:8px;margin:0px -8px"
|class="MainPageBG" style="width:55%;border:1px solid #cef2e0;background-color:#f5fffa;vertical-align:top;color:#000"|
{|width="100%" cellpadding="2" cellspacing="5" style="vertical-align:top;background-color:#f5fffa"
! <h2 style="margin:0;background-color:#cef2e0;font-size:120%;font-weight:bold;border:1px solid #a3bfb1;text-align:left;color:#000;padding:0.2em 0.4em;">Categories in RecSysWiki</h2>
|-
|style="color:#000"|
* [[:Category:Books|Books]]
* [[:Category:Competition|Competitions]]
* [[:Category:Conferences|Conferences]]
* [[:Category:Datasets|Datasets]]
* [[:Category:Educational Recommender Systems|Educational Recommender Systems]]
* [[:Category:Evaluation|Evaluation]]
* [[:Category:Evaluation measure|Evaluation Measures]]
* [[:Category:Literature|Literature]]
* [[:Category:Methods|Methods]]
* [[:Category:Movie Recommendation|Movie Recommendation]]
* [[:Category:Music Recommendation|Music Recommendation]]
* [[:Category:Papers|Papers]]
* [[:Category:Research Groups|Research Groups]]
* [[:Category:Task|Task]]
* [[:Category:User-centric evaluation|User-centric evaluation]]
* [[:Category:Workshops|Workshops]]

|-
|}
|class="MainPageBG" style="width:65%;border:1px solid #cedff2;background-color:#f5faff;vertical-align:top"|
{| width="100%" cellpadding="2" cellspacing="5" style="vertical-align:top;background-color:#f5faff"
!
<h2 style="margin:0;background-color:#cedff2;font-size:120%;font-weight:bold;border:1px solid #a3b0bf;text-align:left;color:#000;padding:0.2em 0.4em;">Information on RecSysWiki</h2>
|-
|style="color:#000"|
* [[:RecSysWiki:Current events| Current Events]]
* [[Special:Statistics|RecSysWiki statistics]]
* [[Special:AllPages|All pages]]
* [[Special:Categories|All categories]]
* [[Special:WantedPages|Wanted pages]]
* [[To Do List]] - ''please help by contributing''

|-
|}

|-
|}

[[Category:RecSys Wiki Information]]
{{#TwitterFBLike:}}

User-centric recommender system evaluation

2011-03-01T02:47:47Z

Usabart: Replaced content with "this can be deleted (promoted to category)"

this can be deleted (promoted to category)

Category:User-centric evaluation

2011-03-01T02:46:58Z

Usabart: Created page with "Whereas originally the field of recommender systems heavily focused on offline evaluation, recently awareness has grown that the usability and user experience of recommen..."

Subjective evaluation measures

2011-03-01T02:46:32Z

Usabart:

== Measuring usability and user experience ==
'''Subjective evaluation measures''' are expressions of the users about the system or their interaction with the system. They are therefore typically used to evaluate the usability and user experience of recommender systems. In qualitative studies, subjective measures are user comments, interviews, or questionnaire responses. Subjective evaluations can also be used quantitatively. In this case, closed-format responses (typically questionnaire items) are required for statistical analysis.

== Good questions ==
Care has to be taken that the elicitation of user responses does not interfere with the actual responses they give. Double-barreled questions ("Did the recommender provide novel and relevant items?") can cause confusion and are often very imprecise (what if the user found the items novel, but not relevant?). Leading questions ("How great was our system?") and imbalanced response categories ("How do you rate our system?" - bad, good, great or awesome) can inadvertently push the participants' answers in a certain direction. A typical way to avoid these issues is to ask the user to agree or disagree with a number of statements on a 5- or 7-point scale, e.g.:

"The system helped me make better choices." - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree

"The system did not provide me any benefits" - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree

Note that in order to avoid response format bias, it is good practice to provide both positively and negatively phrased items. Also note that the middle category is not the same as "not applicable", which should be a separate category (if provided at all).

== Multiple items, scale development ==
Usability and user experience concepts such as "satisfaction", "usefulness", and "choice difficulty" are rather nuanced, and it is very hard to measure these concepts robustly with just a single question. It is therefore a better practice to ask multiple questions per concept. There are two ways to combine the answers to these questions into a single scale. The simplistic approach is to sum the answers to the questions (making sure to revert the negatively phrased ones). In order for this to be a valid approach, a reliability analysis should be performed on the answers (Chronbach's alpha). This procedure handles each scale separately.

The more advanced approach is to construct and test all scales at the same time with a factor analysis. A factor analysis evaluates the latent structure of a set of responses by analyzing its covariance matrix. An exploratory factor analysis triest to create an "elegant" factor solution with a specified number of factors. A confirmatory factor analysis tests a predefined factor structure. Even when the factor structure is theoretically determined beforehand, it is good practice to check whether an exploratory factor analysis returns the predicted factor structure. Often, one or two items do not fit the predicted factor structure (they contribute to the wrong factor, several factors, or none of the factors); these items can be deleted from the analysis.

Taking this one step further, one can check for measurement invariance. This procedure ensures that the answers of different types of participants (e.g. males and females, those using system PA and those using system PB) adhere to the same conceptual structure. E.g.: Does "satisfaction" mean the same thing for experts and novices?

Developing a robust scale is usually a complex procedure that takes several iterations. After deleting "bad" questions, a scale should consist of at least 5-7 items to be a robust measurement of the underlying concept. To ensure enough power for adequate scale development, one should have about 5 responses per item. Simultaneously developing 5 robust subjective scales, then, takes about 150 participants. Finally, the developed scales should be correlated (triangulated) with other subjective or objective measures to ensure their external validity. A good subjective scale, however, provides results that are usually far more robust than most [[objective evaluation measures]] which are typically inherently noisy.

== Structural Equation Models ==
A final step in subjective evaluations is to combine scale validation (factor analysis) and causal inference (ANOVA or linear regression) into a single analysis. These '''Structural Equation Models''' provide added statistical power, because they can use the estimated robustness of the constructed scales to provide better estimates of the regression coefficients. Experimental manipulations and [[objective evaluation measures]] can be included into the Structural Equation Model, and the fit of the entire model can be tested as well as the specific regression coefficients.

[[Category:Evaluation]]
[[Category:Evaluation measure]]
[[Category:User-centric evaluation]]

Objective evaluation measures

2011-03-01T02:46:00Z

Usabart:

Aside from the typical recsys measures such as accuracy and precision, in live experiments several other '''objective evaluation measures''' can be taken. These typically concern the users' behavior with the system: number of log-ins, session length, clicks, item views, and purchases. Objective evaluations provide a ground truth for the effect of the system on its users. However, it is sometimes hard to interpret differences in user behavior. For example, if users of a video recommender system click on more clips to watch, does this mean that the user experience is better (more consumption) or worse (more browsing)? In this case, the number of clips watched from beginning to end would be a better measure. Better yet, one can combine objective and [[subjective evaluation measures]]. In [[quantitative user experiments or field trials]], this entails correlating the objective and subjective measures. In [[qualitative user-studies]] (specifically observational studies), this entails asking the user to think aloud while performing tasks/work.

[[Category:Evaluation]]
[[Category:Evaluation measure]]
[[Category:User-centric evaluation]]

User-centric recommender system evaluation

2011-03-01T02:45:41Z

Usabart:

Quantitative user experiments or field trials

2011-03-01T02:45:24Z

Usabart:

== Quantitative evaluation is summative evaluation ==
Quantitative user experiments and field trials are forms of [[summative evaluation]]: they try to find the effect of feature P on quality X, usually by comparing several versions of the system that differ only in terms of P. The difference between a field trial and an experiment is that in a field trial a real system is tested with its real users, while a user experiment often uses a prototype or a downgraded system. The field trial focuses primarily on features of the system, while an experiment can also investigate psychological phenomena in more detail, due to the tight control the experimenter has over the system setup.

== Study participants ==
The official procedure to gather participants in a quantitative user experiment is to randomly select them from the target population (the potential users of the system). This is usually not a feasible approach, so instead a convenience sample is often taken: invitations are sent to participants or posted on a website, and recipients/readers are urged to participate in the study. When taking a convenience sample, one has to take care to prevent a self-selection bias: those who participate in the study may differ in their behaviors and attitudes from those who choose not to participate. Asking friends, family or coworkers is often not a good idea, because these people may have an intrinsic sympathy towards the experimenter. Demographic data can be gathered to get an indication of the match between the participants and the potential users of the system.

Although it is not a problem to tell users that they will be evaluating a new recommender system, it is not a good idea to explain participants the exact purpose (or worse: the expected findings) of the study, because participants are often too willing to please the experimenter and may therefore unconsciously behave as the experimenter expects. It is however a very good practice to inform participants of the purpose (and results) of the study after the experiment has been completed.

==Simple setup and evaluation ==
When testing multiple systems that differ only in aspect P (as is usually the case), participants are randomly assigned to one of the systems, called the experimental conditions. This randomization assures that the users in the different conditions are roughly equally distributed in their intrinsic characteristics (such as age, gender, and domain knowledge). The only difference between the systems, then, is aspect P.

During of after the interaction, the experimenters measure a certain outcome X, that they believe differs between different values of P. Outcomes can be [[objective evaluation measures]] or [[subjective evaluation measures]].

Example: An experimenter may predict that users of an eCommerce recommender system with the new algorithm "PB" spend more money than those using the old algorithm "PA". In this case, participants are randomly assigned to each condition (PA and PB) and their total expenditures are measured. Afterwards, a t-test can be conducted to test the difference in expenditures between PA and PB. In order for such a test to have an adequate power to detect a difference between PA and PB, the test typically needs at least 20 (preferably 50) participants per condition. The t-test provides a probability that the null hypothesis (PA = PB) is false. We usually reject the null hypothesis with p<0.05.

If there are more than two conditions, an ANOVA replaces the t-test. One could of course just do multiple t-tests between each of the different conditions. However, with 5 conditions, there are 10 pairs of t-tests to be conducted. However, if we take 0.05 as a cut-off value for the probability, we reject the null hypothesis despite the absence of an effect in about 5% of the cases on average. With 10 such tests, there is a large chance that at least one of these tests is significant despite the absence of a real effect. The ANOVA first conducts an omnibus test over all the conditions, and then adjusts the cut-off values for p in post-hoc analyses of the individual differences. If there are any predictions on which conditions should differ, one can instead use planned contrasts.

== Covariates, confounders and interactions ==
Measuring user characteristics can improve the power of an experiment by introducing them into the analysis as covariates. In the example above, one could measure the users' anual income, which is likely to be related to expenditures as well. Taking anual income into account reduces the variance of the expenditures, and thereby increases the precision of our effect of PA versus PB.

One may also test several aspects (e.g. P and Q) at the same time. In this case, a separate condition is created for each combination of P and Q. In the example above, one could also manipulate the length of the list of recommendations (e.g. 5 or 10). We would then have 4 conditions: PA-Q5, PB-Q5, PA-Q10 and PB-Q10. Again, participants should be assigned randomly, and about 20 participants are needed in each condition. If P and Q were not independently manipulated (e.g. if we would only test PA-Q5 versus PB-Q10), the effects of P and Q would be confounded. This means that there is no way to find out whether the effect on expenditures were caused by P or by Q.

When there are multiple predictors and/or covariates, one can use ANCOVA or Multiple Linear Regression (MLR) to analyze the results. These two methods are essentially equivalent. Having multiple predictors and/or covariates, one can also test the interaction between them. For instance: algorithm PA may result in more expenditures when it gives only 5 recommendations, while algorithm PB may result in more expenditures when it gives 10 recommendations. The ANCOVA and MLR procedures provide options to specify and test such interactions.

Note that ANCOVA and MLR assume that the modeled outcome is an unrestricted variable with homogeneous variance. Our example already violates this assumption: expenditures cannot take a negative value. The problem of a restricted range can be solved by transforming the variable, for instance using a log or square-root transformation (this works for both outcome and predictor variables: we would use it for anual income as well). The problem of heterogeneous variance can be solved by using poisson regression for counts/rates (e.g. the number of products bought) or logistic regression for binary data (e.g. whether the user returns to the site within a week, yes or no).

== Within-subjects a.k.a. repeated measures experiments ==
A useful way to reduce the number of users needed for an experiment is to do a within-subjects experiment. In such an experiment, participants do not use one but all of the experimental systems, and measures are taken for each of these interactions. Analysis can now focus on the differences between users, instead of between user groups, which increases the power of the analysis.

The problem with within-subjects experiments is that the order of the conditions may influence the outcome. Participants may be more enthusiastic the first time they use a system (novelty effect) or become bored after one or two interactions (user fatigue). When subjective measures are taken, users will inherently compare their interaction with the preceding interactions, and a comparison of B with A may be different from a comparison of A with B. A typical way to deal with this problem is to include all possible orders in the experiment (PA->PB and PB->PA) and randomly assign users to an order. Not all orders are needed; a Latin Square design in which each condition takes each position in the order once is often good enough. The effect of "position" can be used as a predictor in the analysis.

When evaluating usability or user experience with a within-subjects experiment, the effect of order can be so prominent that it overshadows all other effects. The order may also produce all kinds of unpredicted interaction effects. It is therefore advisable to use a standard between-subjects experiment wherever possible.

== Mediators and path models ==
Often, not one but several outcome measures are taken. This allows experimenters to test the effect of aspect P on several outcomes, e.g. perceived recommendation quality (X) and expenditures (Y). However, X can in this case also be used as a covariate in the analysis of the effect of P on Y. If P causes X and if X causes Y, then X is said to be a mediator of the effect of P on Y. If, after controlling for X, there is no residual effect of P on Y, then X is said to fully mediate the relation of P on Y. Using mediation, one can build path models of effects, such as P->X->Y->Z. Statistical software exists that can fit the regressions associated with path models simultaneously.
[[Category:Evaluation]]
[[Category:User-centric evaluation]]

User-centric recommender system evaluation

2011-03-01T02:44:26Z

Usabart:

User-centric recommender system evaluation

2011-03-01T02:42:37Z

Usabart:

Subjective evaluation measures

2011-02-21T22:09:53Z

Usabart:

Subjective evaluation measures

2011-02-21T22:05:14Z

Usabart:

Subjective evaluation measures

2011-02-21T22:03:43Z

Usabart: /* Good questions */

== Measuring usability and user experience ==
""Subjective evaluation measures"" are expressions of the users about the system or their interaction with the system. They are therefore typically used to evaluate the usability and user experience of recommender systems. In qualitative studies, subjective measures are user comments, interviews, or questionnaire responses. Subjective evaluations can also be used quantitatively. In this case, closed-format responses (typically questionnaire items) are required for statistical analysis.

== Good questions ==
Care has to be taken that the elicitation of user responses does not interfere with the actual responses they give. Double-barreled questions ("Did the recommender provide novel and relevant items?") can cause confusion and are often very imprecise (what if the user found the items novel, but not relevant?). Leading questions ("How great was our system?") and imbalanced response categories ("How do you rate our system?" - bad, good, great or awesome) can inadvertently push the participants' answers in a certain direction. A typical way to avoid these issues is to ask the user to agree or disagree with a number of statements on a 5- or 7-point scale, e.g.:

"The system helped me make better choices." - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree

"The system did not provide me any benefits" - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree

Note that in order to avoid response format bias, it is good practice to provide both positively and negatively phrased items. Also note that the middle category is not the same as "not applicable", which should be a separate category (if provided at all).

== Multiple items, scale development ==
Usability and user experience concepts such as "satisfaction", "usefulness", and "choice difficulty" are rather nuanced, and it is very hard to measure these concepts robustly with just a single question. It is therefore a better practice to ask multiple questions per concept. There are two ways to combine the answers to these questions into a single scale. The simplistic approach is to sum the answers to the questions (making sure to revert the negatively phrased ones). In order for this to be a valid approach, a reliability analysis should be performed on the answers (Chronbach's alpha). This procedure handles each scale separately.

The more advanced approach is to construct and test all scales at the same time with a factor analysis. A factor analysis evaluates the latent structure of a set of responses by analyzing its covariance matrix. An exploratory factor analysis triest to create an "elegant" factor solution with a specified number of factors. A confirmatory factor analysis tests a predefined factor structure. Even when the factor structure is theoretically determined beforehand, it is good practice to check whether an exploratory factor analysis returns the predicted factor structure. Often, one or two items do not fit the predicted factor structure (they contribute to the wrong factor, several factors, or none of the factors); these items can be deleted from the analysis.

Taking this one step further, one can check for measurement invariance. This procedure ensures that the answers of different types of participants (e.g. males and females, those using system PA and those using system PB) adhere to the same conceptual structure. E.g.: Does "satisfaction" mean the same thing for experts and novices?

Developing a robust scale is usually a complex procedure that takes several iterations. After deleting "bad" questions, a scale should consist of at least 5-7 items to be a robust measurement of the underlying concept. To ensure enough power for adequate scale development, one should have about 5 responses per item. Simultaneously developing 5 robust subjective scales, then, takes about 150 participants. Finally, the developed scales should be correlated (triangulated) with other subjective or objective measures to ensure their external validity. A good subjective scale, however, provides results that are usually far more robust than most [[objective evaluation measures]] which are typically inherently noisy.

== Structural Equation Models ==
A final step in subjective evaluations is to combine scale validation (factor analysis) and causal inference (ANOVA or linear regression) into a single analysis. These ""Structural Equation Models"" provide added statistical power, because they can use the estimated robustness of the constructed scales to provide better estimates of the regression coefficients. Experimental manipulations and [[objective evaluation measures]] can be included into the Structural Equation Model, and the fit of the entire model can be tested as well as the specific regression coefficients.

Subjective evaluation measures

2011-02-21T22:03:28Z

Usabart: /* Multiple items, scale development */

== Measuring usability and user experience ==
""Subjective evaluation measures"" are expressions of the users about the system or their interaction with the system. They are therefore typically used to evaluate the usability and user experience of recommender systems. In qualitative studies, subjective measures are user comments, interviews, or questionnaire responses. Subjective evaluations can also be used quantitatively. In this case, closed-format responses (typically questionnaire items) are required for statistical analysis.

== Good questions ==
Care has to be taken that the elicitation of user responses does not interfere with the actual responses they give. Double-barreled questions ("Did the recommender provide novel and relevant items?") can cause confusion and are often very imprecise (what if the user found the items novel, but not relevant?). Leading questions ("How great was our system?") and imbalanced response categories ("How do you rate our system?" - bad, good, great or awesome) can inadvertently push the participants' answers in a certain direction. A typical way to avoid these issues is to ask the user to agree or disagree with a number of statements on a 5- or 7-point scale, e.g.:

"The system helped me make better choices." - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree
"The system did not provide me any benefits" - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree

Note that in order to avoid response format bias, it is good practice to provide both positively and negatively phrased items. Also note that the middle category is not the same as "not applicable", which should be a separate category (if provided at all).

== Multiple items, scale development ==
Usability and user experience concepts such as "satisfaction", "usefulness", and "choice difficulty" are rather nuanced, and it is very hard to measure these concepts robustly with just a single question. It is therefore a better practice to ask multiple questions per concept. There are two ways to combine the answers to these questions into a single scale. The simplistic approach is to sum the answers to the questions (making sure to revert the negatively phrased ones). In order for this to be a valid approach, a reliability analysis should be performed on the answers (Chronbach's alpha). This procedure handles each scale separately.

The more advanced approach is to construct and test all scales at the same time with a factor analysis. A factor analysis evaluates the latent structure of a set of responses by analyzing its covariance matrix. An exploratory factor analysis triest to create an "elegant" factor solution with a specified number of factors. A confirmatory factor analysis tests a predefined factor structure. Even when the factor structure is theoretically determined beforehand, it is good practice to check whether an exploratory factor analysis returns the predicted factor structure. Often, one or two items do not fit the predicted factor structure (they contribute to the wrong factor, several factors, or none of the factors); these items can be deleted from the analysis.

Taking this one step further, one can check for measurement invariance. This procedure ensures that the answers of different types of participants (e.g. males and females, those using system PA and those using system PB) adhere to the same conceptual structure. E.g.: Does "satisfaction" mean the same thing for experts and novices?

Developing a robust scale is usually a complex procedure that takes several iterations. After deleting "bad" questions, a scale should consist of at least 5-7 items to be a robust measurement of the underlying concept. To ensure enough power for adequate scale development, one should have about 5 responses per item. Simultaneously developing 5 robust subjective scales, then, takes about 150 participants. Finally, the developed scales should be correlated (triangulated) with other subjective or objective measures to ensure their external validity. A good subjective scale, however, provides results that are usually far more robust than most [[objective evaluation measures]] which are typically inherently noisy.

== Structural Equation Models ==
A final step in subjective evaluations is to combine scale validation (factor analysis) and causal inference (ANOVA or linear regression) into a single analysis. These ""Structural Equation Models"" provide added statistical power, because they can use the estimated robustness of the constructed scales to provide better estimates of the regression coefficients. Experimental manipulations and [[objective evaluation measures]] can be included into the Structural Equation Model, and the fit of the entire model can be tested as well as the specific regression coefficients.

Subjective evaluation measures

2011-02-21T22:02:10Z

Usabart: Created page with "== Measuring usability and user experience == ""Subjective evaluation measures"" are expressions of the users about the system or their interaction with the system. They are ther..."

== Measuring usability and user experience ==
""Subjective evaluation measures"" are expressions of the users about the system or their interaction with the system. They are therefore typically used to evaluate the usability and user experience of recommender systems. In qualitative studies, subjective measures are user comments, interviews, or questionnaire responses. Subjective evaluations can also be used quantitatively. In this case, closed-format responses (typically questionnaire items) are required for statistical analysis.

== Good questions ==
Care has to be taken that the elicitation of user responses does not interfere with the actual responses they give. Double-barreled questions ("Did the recommender provide novel and relevant items?") can cause confusion and are often very imprecise (what if the user found the items novel, but not relevant?). Leading questions ("How great was our system?") and imbalanced response categories ("How do you rate our system?" - bad, good, great or awesome) can inadvertently push the participants' answers in a certain direction. A typical way to avoid these issues is to ask the user to agree or disagree with a number of statements on a 5- or 7-point scale, e.g.:

"The system helped me make better choices." - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree
"The system did not provide me any benefits" - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree

Note that in order to avoid response format bias, it is good practice to provide both positively and negatively phrased items. Also note that the middle category is not the same as "not applicable", which should be a separate category (if provided at all).

== Multiple items, scale development ==
Usability and user experience concepts such as "satisfaction", "usefulness", and "choice difficulty" are rather nuanced, and it is very hard to measure these concepts robustly with just a single question. It is therefore a better practice to ask multiple questions per concept. There are two ways to combine the answers to these questions into a single scale. The simplistic approach is to sum the answers to the questions (making sure to revert the negatively phrased ones). In order for this to be a valid approach, a reliability analysis should be performed on the answers (Chronbach's alpha). This procedure handles each scale separately.

The more advanced approach is to construct and test all scales at the same time with a factor analysis. A factor analysis evaluates the latent structure of a set of responses by analyzing its covariance matrix. An exploratory factor analysis triest to create an "elegant" factor solution with a specified number of factors. A confirmatory factor analysis tests a predefined factor structure. Even when the factor structure is theoretically determined beforehand, it is good practice to check whether an exploratory factor analysis returns the predicted factor structure. Often, one or two items do not fit the predicted factor structure (they contribute to the wrong factor, several factors, or none of the factors); these items can be deleted from the analysis.

Taking this one step further, one can check for measurement invariance. This procedure ensures that the answers of different types of participants (e.g. males and females, those using system PA and those using system PB) adhere to the same conceptual structure. E.g.: Does "satisfaction" mean the same thing for experts and novices?

Developing a robust scale is usually a complex procedure that takes several iterations. After deleting "bad" questions, a scale should consist of at least 5-7 items to be a robust measurement of the underlying concept. To ensure enough power for adequate scale development, one should have about 5 responses per item. Simultaneously developing 5 robust subjective scales, then, takes about 150 participants. Finally, the developed scales should be correlated (triangulated) with other subjective or objective measures to ensure their external validity.

== Structural Equation Models ==
A final step in subjective evaluations is to combine scale validation (factor analysis) and causal inference (ANOVA or linear regression) into a single analysis. These ""Structural Equation Models"" provide added statistical power, because they can use the estimated robustness of the constructed scales to provide better estimates of the regression coefficients. Experimental manipulations and [[objective evaluation measures]] can be included into the Structural Equation Model, and the fit of the entire model can be tested as well as the specific regression coefficients.

Objective evaluation measures

2011-02-21T21:23:47Z

Usabart:

Objective evaluation measures

2011-02-21T21:13:12Z

Usabart: Created page with "Aside from the typical recsys measures such as accuracy and precision, in live experiments several other objective evaluation measures can be taken. These typically concern the u..."

Aside from the typical recsys measures such as accuracy and precision, in live experiments several other objective evaluation measures can be taken. These typically concern the users' behavior with the system: number of log-ins, session length, clicks, item views, and purchases. Objective evaluations provide a ground truth for the effect of the system on its users. However, it is sometimes hard to interpret differences in user behavior. For example, if users of a video recommender system click on more clips to watch, does this mean that the user experience is better (more consumption) or worse (more browsing)? In this case, the number of clips watched from beginning to end would be a better measure. Better yet, one can triangulate the objective measure with [[subjective evaluation measures]].

Quantitative user experiments or field trials

2011-02-21T21:06:10Z

Usabart: /* Simple setup and evaluation */

Quantitative user experiments or field trials

2011-02-21T20:39:01Z

Usabart:

== Quantitative evaluation is summative evaluation ==
Quantitative user experiments and field trials are forms of [[summative evaluation]]: they try to find the effect of feature P on quality X, usually by comparing several versions of the system that differ only in terms of P. The difference between a field trial and an experiment is that in a field trial a real system is tested with its real users, while a user experiment often uses a prototype or a downgraded system. The field trial focuses primarily on features of the system, while an experiment can also investigate psychological phenomena in more detail, due to the tight control the experimenter has over the system setup.

== Study participants ==
The official procedure to gather participants in a quantitative user experiment is to randomly select them from the target population (the potential users of the system). This is usually not a feasible approach, so instead a convenience sample is often taken: invitations are sent to participants or posted on a website, and recipients/readers are urged to participate in the study. When taking a convenience sample, one has to take care to prevent a self-selection bias: those who participate in the study may differ in their behaviors and attitudes from those who choose not to participate. Asking friends, family or coworkers is often not a good idea, because these people may have an intrinsic sympathy towards the experimenter. Demographic data can be gathered to get an indication of the match between the participants and the potential users of the system.

Although it is not a problem to tell users that they will be evaluating a new recommender system, it is not a good idea to explain participants the exact purpose (or worse: the expected findings) of the study, because participants are often too willing to please the experimenter and may therefore unconsciously behave as the experimenter expects. It is however a very good practice to inform participants of the purpose (and results) of the study after the experiment has been completed.

==Simple setup and evaluation ==
When testing multiple systems that differ only in aspect P (as is usually the case), participants are randomly assigned to one of the systems, called the experimental conditions. This randomization assures that the users in the different conditions are roughly equally distributed in their intrinsic characteristics (such as age, gender, and domain knowledge). The only difference between the systems, then, is aspect P.

Example: An experimenter may predict that users of an eCommerce recommender system with the new algorithm "PB" spend more money than those using the old algorithm "PA". In this case, participants are randomly assigned to each condition (PA and PB) and their total expenditures are measured. Afterwards, a t-test can be conducted to test the difference in expenditures between PA and PB. In order for such a test to have an adequate power to detect a difference between PA and PB, the test typically needs at least 20 (preferably 50) participants per condition. The t-test provides a probability that the null hypothesis (PA = PB) is false. We usually reject the null hypothesis with p<0.05.

If there are more than two conditions, an ANOVA replaces the t-test. One could of course just do multiple t-tests between each of the different conditions. However, with 5 conditions, there are 10 pairs of t-tests to be conducted. However, if we take 0.05 as a cut-off value for the probability, we reject the null hypothesis despite the absence of an effect in about 5% of the cases on average. With 10 such tests, there is a large chance that at least one of these tests is significant despite the absence of a real effect. The ANOVA first conducts an omnibus test over all the conditions, and then adjusts the cut-off values for p in post-hoc analyses of the individual differences. If there are any predictions on which conditions should differ, one can instead use planned contrasts.

== Covariates, confounders and interactions ==
Measuring user characteristics can improve the power of an experiment by introducing them into the analysis as covariates. In the example above, one could measure the users' anual income, which is likely to be related to expenditures as well. Taking anual income into account reduces the variance of the expenditures, and thereby increases the precision of our effect of PA versus PB.

One may also test several aspects (e.g. P and Q) at the same time. In this case, a separate condition is created for each combination of P and Q. In the example above, one could also manipulate the length of the list of recommendations (e.g. 5 or 10). We would then have 4 conditions: PA-Q5, PB-Q5, PA-Q10 and PB-Q10. Again, participants should be assigned randomly, and about 20 participants are needed in each condition. If P and Q were not independently manipulated (e.g. if we would only test PA-Q5 versus PB-Q10), the effects of P and Q would be confounded. This means that there is no way to find out whether the effect on expenditures were caused by P or by Q.

When there are multiple predictors and/or covariates, one can use ANCOVA or Multiple Linear Regression (MLR) to analyze the results. These two methods are essentially equivalent. Having multiple predictors and/or covariates, one can also test the interaction between them. For instance: algorithm PA may result in more expenditures when it gives only 5 recommendations, while algorithm PB may result in more expenditures when it gives 10 recommendations. The ANCOVA and MLR procedures provide options to specify and test such interactions.

Note that ANCOVA and MLR assume that the modeled outcome is an unrestricted variable with homogeneous variance. Our example already violates this assumption: expenditures cannot take a negative value. The problem of a restricted range can be solved by transforming the variable, for instance using a log or square-root transformation (this works for both outcome and predictor variables: we would use it for anual income as well). The problem of heterogeneous variance can be solved by using poisson regression for counts/rates (e.g. the number of products bought) or logistic regression for binary data (e.g. whether the user returns to the site within a week, yes or no).

== Within-subjects a.k.a. repeated measures experiments ==
A useful way to reduce the number of users needed for an experiment is to do a within-subjects experiment. In such an experiment, participants do not use one but all of the experimental systems, and measures are taken for each of these interactions. Analysis can now focus on the differences between users, instead of between user groups, which increases the power of the analysis.

The problem with within-subjects experiments is that the order of the conditions may influence the outcome. Participants may be more enthusiastic the first time they use a system (novelty effect) or become bored after one or two interactions (user fatigue). When subjective measures are taken, users will inherently compare their interaction with the preceding interactions, and a comparison of B with A may be different from a comparison of A with B. A typical way to deal with this problem is to include all possible orders in the experiment (PA->PB and PB->PA) and randomly assign users to an order. Not all orders are needed; a Latin Square design in which each condition takes each position in the order once is often good enough. The effect of "position" can be used as a predictor in the analysis.

When evaluating usability or user experience with a within-subjects experiment, the effect of order can be so prominent that it overshadows all other effects. The order may also produce all kinds of unpredicted interaction effects. It is therefore advisable to use a standard between-subjects experiment wherever possible.

== Mediators and path models ==
Often, not one but several outcome measures are taken. This allows experimenters to test the effect of aspect P on several outcomes, e.g. perceived recommendation quality (X) and expenditures (Y). However, X can in this case also be used as a covariate in the analysis of the effect of P on Y. If P causes X and if X causes Y, then X is said to be a mediator of the effect of P on Y. If, after controlling for X, there is no residual effect of P on Y, then X is said to fully mediate the relation of P on Y. Using mediation, one can build path models of effects, such as P->X->Y->Z. Statistical software exists that can fit the regressions associated with path models simultaneously.

Quantitative user experiments or field trials

2011-02-21T20:38:40Z

Usabart: Created page with "== Quantitative evaluation is summative evaluation == Quantitative user experiments and field trials are forms of summative evaluation: they try to find the effect of feature..."

== Quantitative evaluation is summative evaluation ==
Quantitative user experiments and field trials are forms of [[summative evaluation]]: they try to find the effect of feature P on quality X, usually by comparing several versions of the system that differ only in terms of P. The difference between a field trial and an experiment is that in a field trial a real system is tested with its real users, while a user experiment often uses a prototype or a downgraded system. The field trial focuses primarily on features of the system, while an experiment can also investigate psychological phenomena in more detail, due to the tight control the experimenter has over the system setup.

== Study participants ==
The official procedure to gather participants in a quantitative user experiment is to randomly select them from the target population (the potential users of the system). This is usually not a feasible approach, so instead a convenience sample is often taken: invitations are sent to participants or posted on a website, and recipients/readers are urged to participate in the study. When taking a convenience sample, one has to take care to prevent a self-selection bias: those who participate in the study may differ in their behaviors and attitudes from those who choose not to participate. Asking friends, family or coworkers is often not a good idea, because these people may have an intrinsic sympathy towards the experimenter. Demographic data can be gathered to get an indication of the match between the participants and the potential users of the system.

Although it is not a problem to tell users that they will be evaluating a new recommender system, it is not a good idea to explain participants the exact purpose (or worse: the expected findings) of the study, because participants are often too willing to please the experimenter and may therefore unconsciously behave as the experimenter expects. It is however a very good practice to inform participants of the purpose (and results) of the study after the experiment has been completed.

==Simple setup and evaluation ==
When testing multiple systems that differ only in aspect P (as is usually the case), participants are randomly assigned to one of the systems, called the experimental conditions. This randomization assures that the users in the different conditions are roughly equally distributed in their intrinsic characteristics (such as age, gender, and domain knowledge). The only difference between the systems, then, is aspect P.

Example: An experimenter may predict that users of an eCommerce recommender system with the new algorithm "PB" spend more money than those using the old algorithm "PA". In this case, participants are randomly assigned to each condition (PA and PB) and their total expenditures are measured. Afterwards, a t-test can be conducted to test the difference in expenditures between PA and PB. In order for such a test to have an adequate power to detect a difference between PA and PB, the test typically needs at least 20 (preferably 50) participants per condition. The t-test provides a probability that the null hypothesis (PA = PB) is false. We usually reject the null hypothesis with p<0.05.

If there are more than two conditions, an ANOVA replaces the t-test. One could of course just do multiple t-tests between each of the different conditions. However, with 5 conditions, there are 10 pairs of t-tests to be conducted. However, if we take 0.05 as a cut-off value for the probability, we reject the null hypothesis despite the absence of an effect in about 5% of the cases on average. With 10 such tests, there is a large chance that at least one of these tests is significant despite the absence of a real effect. The ANOVA first conducts an omnibus test over all the conditions, and then adjusts the cut-off values for p in post-hoc analyses of the individual differences. If there are any predictions on which conditions should differ, one can instead use planned contrasts.

== Covariates, confounders and interactions ==
Measuring user characteristics can improve the power of an experiment by introducing them into the analysis as covariates. In the example above, one could measure the users' anual income, which is likely to be related to expenditures as well. Taking anual income into account reduces the variance of the expenditures, and thereby increases the precision of our effect of PA versus PB.

One may also test several aspects (e.g. P and Q) at the same time. In this case, a separate condition is created for each combination of P and Q. In the example above, one could also manipulate the length of the list of recommendations (e.g. 5 or 10). We would then have 4 conditions: PA-Q5, PB-Q5, PA-Q10 and PB-Q10. Again, participants should be assigned randomly, and about 20 participants are needed in each condition. If P and Q were not independently manipulated (e.g. if we would only test PA-Q5 versus PB-Q10), the effects of P and Q would be confounded. This means that there is no way to find out whether the effect on expenditures were caused by P or by Q.

When there are multiple predictors and/or covariates, one can use ANCOVA or Multiple Linear Regression (MLR) to analyze the results. These two methods are essentially equivalent. Having multiple predictors and/or covariates, one can also test the interaction between them. For instance: algorithm PA may result in more expenditures when it gives only 5 recommendations, while algorithm PB may result in more expenditures when it gives 10 recommendations. The ANCOVA and MLR procedures provide options to specify and test such interactions.

Note that ANCOVA and MLR assume that the modeled outcome is an unrestricted variable with homogeneous variance. Our example already violates this assumption: expenditures cannot take a negative value. The problem of a restricted range can be solved by transforming the variable, for instance using a log or square-root transformation (this works for both outcome and predictor variables: we would use it for anual income as well). The problem of heterogeneous variance can be solved by using poisson regression for counts/rates (e.g. the number of products bought) or logistic regression for binary data (e.g. whether the user returns to the site within a week, yes or no).

== Within-subjects a.k.a. repeated measures experiment ==
A useful way to reduce the number of users needed for an experiment is to do a within-subjects experiment. In such an experiment, participants do not use one but all of the experimental systems, and measures are taken for each of these interactions. Analysis can now focus on the differences between users, instead of between user groups, which increases the power of the analysis.

The problem with within-subjects experiments is that the order of the conditions may influence the outcome. Participants may be more enthusiastic the first time they use a system (novelty effect) or become bored after one or two interactions (user fatigue). When subjective measures are taken, users will inherently compare their interaction with the preceding interactions, and a comparison of B with A may be different from a comparison of A with B. A typical way to deal with this problem is to include all possible orders in the experiment (PA->PB and PB->PA) and randomly assign users to an order. Not all orders are needed; a Latin Square design in which each condition takes each position in the order once is often good enough. The effect of "position" can be used as a predictor in the analysis.

When evaluating usability or user experience with a within-subjects experiment, the effect of order can be so prominent that it overshadows all other effects. The order may also produce all kinds of unpredicted interaction effects. It is therefore advisable to use a standard between-subjects experiment wherever possible.

== Mediators and path models ==
Often, not one but several outcome measures are taken. This allows experimenters to test the effect of aspect P on several outcomes, e.g. perceived recommendation quality (X) and expenditures (Y). However, X can in this case also be used as a covariate in the analysis of the effect of P on Y. If P causes X and if X causes Y, then X is said to be a mediator of the effect of P on Y. If, after controlling for X, there is no residual effect of P on Y, then X is said to fully mediate the relation of P on Y. Using mediation, one can build path models of effects, such as P->X->Y->Z. Statistical software exists that can fit the regressions associated with path models simultaneously.

Summative evaluation

2011-02-21T19:11:20Z

Usabart:

The goal of summative evaluation is to find out whether feature P causes quality X (regardless of the system that uses feature P). The usual summative approach is to test system A versus system B, where these systems only differ on feature P, and then measuring quality X to see if it differs between the two systems. Summative methods include A/B tests (field trials) and controlled experiments.

Researchers planning to do a [[user-centric recommender system evaluation]] need to be aware of the [[trade-offs between formative and summative evaluation]].

Summative evaluation

2011-02-21T19:10:42Z

Usabart: Created page with "The goal of summative evaluation is to find out whether feature P causes quality X (regardless of the system that uses feature P). The usual summative approach is to test system ..."

Trade-offs between formative and summative evaluation

2011-02-21T19:09:59Z

Usabart: Created page with "Formative evaluation is quicker and cheaper to conduct than summative evaluation, and the results are more straightforward. However, the method is less suitable for adaptive syst..."

Formative evaluation is quicker and cheaper to conduct than summative evaluation, and the results are more straightforward. However, the method is less suitable for adaptive systems (including recommender systems), because it is hard to find out what exactly causes (problems with) the usability or user experience. In [[summative evaluation]] you need to define hypotheses beforehand, you can only focus on a few aspects at a time, and the analysis is more complex. You also need more test users to ascertain adequate statistical power. On the other hand, it is easier to test adaptive systems because you can single out the effect of specific features. Moreover, the results of summative evaluation are more generalizable and can be statistically validated.

Formative evaluation

2011-02-21T19:08:44Z

Usabart:

Formative evaluation

2011-02-21T19:06:33Z

Usabart:

Qualitative user-studies

2011-02-21T19:04:53Z

Usabart:

Formative evaluation

2011-02-21T19:04:09Z

Usabart: Created page with "In formative research, the goal is to improve a certain system A in terms of a certain quality X. The usual formative approach is to test system A qualitatively, focusing on qual..."

Qualitative user-studies

2011-02-21T19:03:26Z

Usabart: Created page with "Typical for qualitative user studies is their reliance on rich data from few test participants. These types of studies are often used as formative evaluation. Results of qual..."

User-centric recommender system evaluation

2011-02-21T18:58:01Z

UCERSTI

2011-02-21T18:52:57Z

Usabart:

The [http://ucersti.ieis.tue.nl/ Workshop on User-Centric Evaluation of Recommender Systems] and Their Interfaces was organized during the 2010 ACM Conference on Recommender Systems (RecSys2011). Presented papers can be found at [http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-612/ CEUR-WS (Vol 612)]. The workshop presented 7 papers and 2 keynotes addressing a wide array of research on [[user-centric recommender system evaluation]].

UCERSTI

2011-02-21T18:51:43Z

Usabart:

UCERSTI

2011-02-21T18:45:38Z

Usabart: Created page with "The [http://ucersti.ieis.tue.nl/ Workshop on User-Centric Evaluation of Recommender Systems] and Their Interfaces was organized during the 2010 ACM Conference on Recommender Syst..."