Can evaluators be the bridge in the research-practice gap?

Researchers and practitioners agree that there is a gap between research (or theory) and practice. While the reasons for this gap are plentiful, they boil down to researchers and practitioners comprising two communities (Caplan, 1979) such that have different languages, values, reward systems, and priorities. The two communities try to bridge the gap through a variety of methods including producer-push models (e.g., knowledge transfer, knowledge translation, dissemination, applied research, interdisciplinary scholarship), user-pull models (e.g., evidence-based practice, practitioner inquiry, action research), and exchange models (e.g., research-practice partnerships and collaboratives, knowledge brokers, intermediaries). However, these methods typically focus on researchers or practitioners and do not consider other scholars that could fill this role.

As I will argue in the review paper for my dissertation, evaluators are in a prime position to bridge the gap between researchers and practitioners. Evaluation has been considered a transdiscipline in that it is an essential tool in all other academic disciplines (Scriven, 2008). Evaluators use social science (and other) research methodology and often have a specific area of content expertise, enabling them to bridge the gap to researchers. Furthermore, evaluation often requires a close relationship with practitioners to create evaluations that communicate in their language, speak to their values and priorities, and meet their needs to produce a useful evaluation, enabling them to also bridge the gap to practitioners. Evaluators can use their similarities with both researchers and practitioners to span the gap between researchers and practitioners as knowledge brokers or intermediaries (see figure).

However, while evaluators may span the bridge to researchers and practitioners individually, they may not be working to bridge the gap between researchers and practitioners. In a field that still debates the paradigm wars (e.g., the “gold standard” evaluation, qualitative versus quantitative data), the role of evaluators (e.g., as an advocate for programs), core competencies for evaluators, and professionalization of the evaluation field, it is unclear to what extent evaluators see their role encompassing bridging the research-practice gap and, if so, to what extent evaluators are actually working to bridge this gap and how they are doing so.

Stay tuned as I continue blogging about the review paper for my dissertation (i.e., the first chapter of my dissertation). I would sincerely appreciate any and all comments and criticism you may have. It will only strengthen my research and hopefully aid in my ultimate goal of informing the field of evaluation and improving evaluation practice.

Evaluation is Not Applied Research

What is the difference between evaluation and research, especially applied research? For some, they are one and the same. Evaluation and research use the same methods, write the same types of reports, and come to the same conclusions. Evaluation is often described as applied research. For instance, here are some recent quotes describing what evaluation is: “Evaluation is applied research that aims to assess the worth of a service.” (Barker, Pistrang, & Elliott, 2016). “Program evaluation is applied research that asks practical questions and is performed in real-life situations.” (Hackbarth & Gall, 2005), and the current editor of the American Journal of Evaluation saying that “evaluation is applied research.” (Rallis, 2014). This is confusing for introductory evaluation students, particularly those coming from a research background or studying evaluation at a research institution.

Others claim the distinction between evaluation and (applied) research is too hard to define. I do not disagree with this point. The boundaries between evaluation and research are fuzzy in many regards. Take, for instance, evaluation methodology. Our designs and methods are largely derived from social science methodology. However, as Mathison (2008) notes in her article on the distinctions between evaluation and research, evaluation has gone much further in the types of designs and methods it uses such as significant change technique, photovoice, cluster evaluation, evaluability assessment, and success case method. Scriven and Davidson have begun discussing evaluation-specific methodology (i.e., the methods distinct to evaluation), including needs and values assessment, merit determination methods (e.g., rubrics), importance weighting methodologies, evaluative synthesis methodologies, and value-for-money analysis (Davidson, 2013). These methods show that, while we indeed incorporate social science methodology, we are more than that and have unique methods beyond that.

This is no better illustrated than by the hourglass analogy provided by John LaVelle. The differences between research and evaluation are clear at the beginning and end of each process, but when it comes to the middle (methods and analysis), they are quite similar. Thus, evaluation differs from research in a multitude of ways. The following table should be interpreted with a word of caution. The table suggests clear delineations between research and evaluation, but as Mathison notes, many of the distinctions offered (e.g., evaluation particularizes while research generalizes) are not “singularly true for either evaluation or research.” (p. 189, 2008).

Area of difference Research Evaluation
Purpose Seek to generate new knowledge to inform the research base Seek to generate knowledge for a particular program or client
Who decides Researchers Stakeholders
What questions are asked Researchers formulate their own hypotheses Evaluators answer questions that the program is concerned with
Value judgments Research is value neutral Evaluators provide a value judgment
Action setting Basic research takes place in controlled environments Evaluation takes place in an action setting where few things can be controlled
Utility Research emphasizes “production of knowledge and leaves its use to the natural processes of dissemination and application.” (Weiss, 1997) Evaluation is concerned with use from the beginning
Publication Basic research is published in journals Evaluation is rarely published, typically only stakeholders can view the reports

I want to conclude by saying that if we are to call ourselves a transdscipline or an alpha discipline, like Scriven would argue we are, then we should work hard to differentiate ourselves from other disciplines, particularly basic and applied research. This may be difficult, particularly between applied research and evaluation, but we need to make these differences as explicit as possible, partly to help incoming evaluators in the field understand the differences (see EvalTalk for this repetitive question since 1998; Mathison, 2008) and partly to separate ourselves from research (and research from evaluation).

Why aren’t evaluators adapting their evaluations to the developmental context?

Overall, my study found that evaluators are less likely to be participatory—both in the overall evaluation process and in data collection methods—when the program beneficiaries are children than when they are adults. Why is this the case?

One possibility is that the evaluators in my study were not well-versed in working with youth. However, half of the evaluators were in the Youth Focused Evaluation TIG or the PreK-12 Educational Evaluation TIG, indicating they had some experience working with youth programs. Membership in these TIGs and self-reported developmental knowledge did not really relate to their evaluation practices.

Another possibility is that some other evaluation characteristic, such as their education level, their evaluation role (i.e., internal or external), or years of experience as an evaluator, could relate to their developmentally appropriate practice. Again, there were few differences between these evaluation characteristics in their evaluation practices.

Thus, the questions remain: which evaluators are more likely to have developmentally appropriate practice and what are the barriers to developmentally appropriate practice?

Some previous research suggests that experienced evaluators, even those experienced in working with youth, may need help in conducting developmentally appropriate evaluations. In a content analysis of youth program evaluations, Silvana Bialosiewicz (2013) found that few evaluations reported developmentally appropriate practices. That study was the impetus for the current study. A follow-up study involved interviewing youth evaluators and found many barriers to high quality youth program evaluation practice (Bialosiewicz, 2015). These barriers included cost and time needed and misconceptions from clients about good evaluation practice. Overall, this suggests that evaluators may need more training in developmentally appropriate practice or better resources for conducting developmentally appropriate youth program evaluations.

Next Steps

As with most research, I’m left with many more questions about developmentally appropriate evaluations than I was able to answer. I believe the results of the study suggest more need in examining youth participatory evaluation. However, I’m particularly interested in survey techniques with children and adolescents. I often see misunderstanding about survey methodology in general, and this is exacerbated when surveying children and adolescents. I am hoping to present at AEA 2017 on best practices in surveying children to help remedy this issue, but I also would like to further study this topic.

How evaluators adapt their evaluations to the developmental context: Evaluation methods

Knowledge about children is best obtained directly from youth using interviews, focus groups and surveys. This is in stark contrast to past commonly used methods of observations and ethnography, which were primarily used because researchers did not believe youth could provide reliable and valid data.[1]

In my study, I examined whether evaluators collected data about beneficiaries directly (i.e., interviews, focus groups, surveys) or indirectly (i.e., case studies, observations, archival data). If evaluators indicated they would collect data directly from participants, I also asked them questions about their survey or interview-specific practices.

Overall, evaluators were more likely to indirectly collect data from beneficiaries when they were children and adolescents than when they were adults. For the tutees, evaluators were less likely to survey children or conduct focus groups with children and more likely to conduct observations. Interestingly, evaluators in the child condition were also more likely to survey and conduct focus groups with tutors, as well as collect archival data (as a reminder, the tutors in this condition are adolescents).

The following are some of the interesting differences (or lack-thereof) in the survey and interview-specific methodologies. Evaluators in the child condition were…

  • more likely to have program staff administer the survey and use oral administration and less likely to use online administration.
  • less likely to have the evaluation team conduct interviews and more likely to use interview specialists.
  • more likely to have shorter interviews, fewer participants in focus groups, and focus groups comprised of participants of similar ages.
  • equally likely to use 2-4 (36%), 5-7 (63%), 8-10 (2%), or 11+ (0%) response options in the survey.[2]
  • equally likely to test for internal consistency (62%), test-retest reliability (42%), face validity (70%), criterion validity (32%), construct validity (35%), use factor analysis techniques (52%), or test for moderators (35%).
  • equally likely to use unstructured (0%), semi-structured (92%), or structured (8%) interviews.[3]

[1] Punch, S. (2002). Research with children: The same or different from research with adults? Childhood, 9(3), 321–341.

[2] There were likely no differences due to a floor effect of response options typically used. The response options could be examined in a future study be examining each number between 2-8 individually instead of clustered into categories to avoid this floor effect.

[3] Evaluators overwhelmingly preferred semi-structured interviews regardless of the age of participants.

How evaluators adapt their evaluations to the developmental context: Evaluation design

What evaluation design is best? This debate has raged through the field of evaluation on what constitutes credible evidence[1] with some arguing for RCTs as the “gold standard” and others questioning the superiority of the RCT.

This debate is somewhat meaningless when we understand that the evaluation design is chosen based on the evaluation questions. Evaluations seeking outcomes or impact are perhaps best served by an experimental (i.e., RCT) or quasi-experimental design whereas evaluations seeking the needs of the program and fidelity of implementation are better served by a descriptive (e.g., case study, observational) or correlation (e.g., cohort study, cross-sectional study) design.

In the context of youth programs, however, longitudinal designs may be particularly important. Longitudinal designs are critical for measuring and understanding development over time. They’re especially critical when knowledge of long-term effects, that may not manifest until the end of the program or after services have ended, is needed.

In my study, evaluators did not change their evaluation designs based on the age of participants. I asked evaluators to rank their choice of evaluation design and majority chose quasi-experimental (37%), descriptive/correlation (23%), or experimental (15%) as their primary choice. Few evaluators chose a case study (8%) or ethnographic (4%) design. A further 13% evaluators chose to write in another design, with majority indicating a mixed methods design.

I also asked evaluators how many waves of survey or interview data collection they would do across the three years of the evaluation. For those who responded to survey questions, 69% said they would do a baseline and multiple follow-up surveys, 28% said they would do a baseline and one follow-up, and only 3% said they would only do a baseline or post-test survey. For those who responded to interview questions, 93% said they would do multiple sets of interviews or focus groups and only 7% said they would do only one set. However, there are likely no differences because of the length of the simulation study’s evaluation of three years.

Be sure to check out the previous post on how the evaluation approach differed across age conditions. Stay tuned for more results from my study in terms of the evaluation methods, as well as a discussion explaining these results and next steps!

[1]  Donaldson, S. I., Christie, C. A., & Mark, M. M. (Eds.) (2009). What counts as credible evidence in applied research and evaluation practice? Los Angeles, CA: Sage.

How evaluators adapt their evaluations to the developmental context: Evaluation approach

As mentioned previously, developmentally appropriate evaluation requires a culturally appropriate evaluation in the context of youth programs. This means including youth, or at minimum those with knowledge and experience working with youth, in the evaluation.

In my study, I asked practicing evaluators to describe the levels of involvement there would be across a wide range of stakeholders including school administrators, teachers, parents, program staff, program designers, district personnel, funders, developmental consultants, math consultants, and the tutors and tutees of the program. In particular, I was interested in the levels of involvement of the consultants, the tutors, and the tutees across the evaluators randomly assigned to the child, adolescent, or adult conditions.

Overall, evaluators were less likely to involve tutees in the child condition than the adolescent condition, and evaluators in both conditions were less likely to involve tutees than evaluators in the adult condition. Evaluators were also less likely to involve tutors in the child condition (as a reminder, the tutors in this condition are adolescents) than evaluators in the adult condition. There were no differences in use of consultants across the conditions.

One could argue that some evaluators have the knowledge and expertise required to conduct a culturally appropriate youth program evaluation. Thus, I also examined the extent to which their knowledge and expertise moderated the differences. Evaluators in the Youth Focused Evaluation TIG (a TIG focused on youth participatory evaluation) were most likely to involve beneficiaries than non-youth evaluation TIG members and members in the PreK-12 TIG were least likely to involve beneficiaries. Furthermore, evaluators with more self-reported developmental expertise were less likely to involve beneficiaries.

These results suggest that evaluators are less likely to involve beneficiaries of the program when they are children and adolescents than when they are adults. Evaluators were exposed to the same exact program, with the only difference being the age of beneficiaries.

Stay tuned for more results from my study in terms of the evaluation design and evaluation methods, as well as a discussion explaining these results and next steps!

Comments Requested: College Access Journal Publication

Together with Dr. Nazanin Zargarpour, we were accepted to present the attached paper for the American Education Research Association’s (AERA) 2017 conference. We are very much interested in publishing the following paper and would love to get feedback from interested individuals to help propel the paper forward.

Click here to download the paper: Zargarpour & Wanzer (2017). From college access to success. AERA Paper

Developmental Appropriateness as Cultural Competence in Evaluation

Children and adults differ more than simply age; rather, they differ in culture as well.1 This recognition can be hard for evaluators: as we have all passed through childhood, it is easy to believe we have the same or greater knowledge of children’s culture than they do. Furthermore, our “spatial proximity to children may lead us to believe that we are closer to them than we really are—only different in that (adults claim) children are still growing up (‘developing’) and are often wrong (‘lack understanding’).”2

This points to a need for cultural competence, which the American Evaluation Association (AEA) describes as “critical for the profession and for the greater good of society.”3 Cultural competence practice in evaluation includes:

  • Acknowledging the complexity of cultural identity
  • Recognizing the dynamics of status and power (e.g., the differential power between adults and children)
  • Recognizing and eliminating bias in language
  • Employing culturally (i.e., developmentally) appropriate methods

In particular, culturally competent evaluations require inclusion of cultural expertise on the evaluation team.4 In the case of youth programs, this means inclusion of developmental expertise, which can involve developmental experts (i.e., psychologists, developmental scientists) but evaluators should also strive to include the youth themselves.

A youth participatory approach can reduce the harmful power imbalances between adult evaluators and youth participants,5, are more ethical for youth6, and offer many benefits for children and adolescents, including knowledge about the evaluation process and improvements in self-esteem, decision-making, and problem-solving skills.7

However, a youth participatory approach can vary by a range of levels.8 At the lowest level, participants are simply included as a data source, which can yet more vary by direct (i.e., surveys, interviews) and indirect (i.e., observations, archival data) data collection. Further up the participatory ladder is giving youth input on the evaluation process. The highest level of youth participation is youth actually leading the evaluation, much like they would in a traditional empowerment evaluation.

Inclusion of youth, or at least adult developmental experts, can improve the likelihood of a culturally competent evaluation for the first two bullet points mentioned above. However, evaluators still must make sure the evaluation design and methods are culturally, and therefore developmentally, appropriate. The next post will discuss how evaluators can promote cultural competence across the evaluation process in the context of youth programs.

How Can Evaluation Avoid Lemons?

I recently stumbled across a blog post by Dr. Simine Vazire, an associate professor in psychology at UC Davis, which discussed an economics article by Akerlof, “The market for “lemons”: Quality uncertainty and the market mechanism.” Here’s what he wrote:

In this article, Akerlof employs the used car market to illustrate how a lack of transparency (which he calls “information asymmetry”) destroys markets.  when a seller knows a lot more about a product than buyers do, there is little incentive for the seller to sell good products, because she can pass off shoddy products as good ones, and buyers can’t tell the difference.  the buyer eventually figures out that he can’t tell the difference between good and bad products (“quality uncertainty”), but that the average product is shoddy (because the cars fall apart soon after they’re sold). therefore, buyers come to lose trust in the entire market, refuse to buy any products, and the market falls apart. (Vazire, 2017, “looking under the hood”)

This is much similar to the replication crisis in psychology, and I worry that evaluation may come to many of these same issues. This worry was also expressed by Scriven (2015).1 He writes, “Also depressing was the discovery that the great classic disciplines, although they thought they had a quality control system, in fact the procedure that everyone immediately put forward as performing that function–peer review–turned out to have been hardly ever studied for simple but essential virtues like reliability and validity…” (p. 18).

What peer review system does evaluation have? Scriven put forth meta-evaluation, but the practice is rarely, if ever, done. Scriven says:

This is because our real world practice is largely in the role of consultant, and consultants’ work does not normally undergo peer review. We need to tighten up the trashy way peer review is done in other disciplines and use serious meta-evaluation to fill the gap in our own emerging discipline with respect to that job that we say (and can prove) taht peer review ought to be done in the other disciplines. (p. 19)

Scriven goes on to argue that evaluation has a duty to study how evaluation is conducted in other disciplines, leading evaluation to be the “alpha discipline.” But before we can consider evaluation as the alpha discipline, we have to do a “full analysis of the pragmatics” of evaluation, meaning we need to more clearly define evaluation so that evaluation is considered a key methodology of social science

“that must be mastered in order to do all applied (and some pure) social science. In that way, good evaluation research designs will be the exemplar for much of social science, instead of social science treating personnel or program evaluation as something they can do with their current resources, albeit conceding that there are some specialists in these sub-areas.” (p. 20)

Scriven’s solution of serious meta-evaluation done publicly aligns with the solution promoted by Akerlof: transparency. I further argue that there need to be more serious regulations applied to ensure this transparency, and one way to do this is through professionalization. Unfortunately, professionalization within the American Evaluation Association has met with serious resistance, but work by our colleagues up north (the Canadian Evaluation Society) and work on standards and competencies within AEA are steps forward. I think professionalization will help evaluators more clearly define who evaluators are and what evaluation is so that the field can move forward as the alpha discipline Scriven describes.

 

Past Its Expiration Date: When Longitudinal Research Loses Relevance [GUEST POST]

Jennifer Hamilton is a Dr. Who fan and an applied statistician who loves to talk about evaluation methodology. She is fun at parties. Dr. Hamilton has been working in the field of evaluation since 1996, and has conducted evaluations with federal government programs; state, county, and city agencies and organizations; and foundations and nonprofits.

You can email Jennifer Hamilton at jenniferannehamilton@yahoo.com

My company (the Rockville Institute) was hired to conduct a 5 year evaluation of a national program that supports schools in high-poverty neighborhoods improve the health of students and staff.  Schools monitor their progress using the Center for Disease Control and Prevention’s (CDC) School Health Index.

The program had recently developed an on-line model of support, to supplement their traditional on-site support model, but wasn’t sure if they should take it to scale. They wanted to base their decision on evaluation results. Therefore, we proposed a rigorous randomized study comparing these two different types of support.

The problem was, two years in, the program’s revenue was shrinking and they had to start using the on-line support program, because it was more cost effective. They could not wait for the results of the evaluation to make their decision. In short, the program did not need us anymore.

We knew their decision was made, but we hoped that the study results could still be useful to other programs. We needed to make some changes so that it would be relevant to a broader audience.  We had two groups – less and more intensive support. If we could expand this, by adding a no-support arm, and an even more intensive arm, the results could be relevant to all kinds of programs. So we developed a continuum of support intensity, from no support (new arm), low support (on-line model), moderate support (on-site support) and a new high intensity model of on-site support (new arm).

But where were we going to find these extra schools?   

We knew that schools implementing the program were only a small portion of the universe of schools completing the CDC instrument. The CDC could therefore provide outcome data for matched schools not participating in the program.

 

We also knew that another study of the program was being conducted and was using the same outcome as us. The support was provided in person by program manages with lower caseloads and more time onsite than the moderate support group(M2) from the original design. The other research group could provide outcome data for matched schools receiving a more intensive version of support

But What About the Methodology?

The question is how to add these new groups while retaining the rigor of the original design. While our original schools were randomized into groups, the new schools can only be matched to the randomized pairs. So we are mixing a quasi-experimental design (QED) into an RCT. What does this mean, practically speaking? Well, we have to think about all the possible comparisons.

The original L1/M2 comparison is unchanged and maintains the highest level of internal reliability – because both groups of schools were randomly assigned.  All of the other possible contrasts are still internally reliable, although to a slightly lesser the extent – because they now involved matched schools instead of randomly assigned schools. 

Implications for Evaluators

This study illustrates a common danger of longitudinal designs - they just takes too long in the policy world and programs are typically in flux. But the funder supported efforts to expand the focus beyond the specific program to one that would have broader applicability. This resulted in a hybrid design that still maintained sufficient rigor to respond to broad policy questions. Flexibility in the evaluation can still save an RCT, and this mixed QED-RCT design can help!