My introduction to GIS

Having wrestled with the open source QGIS package a few weeks ago, my first attempt at modelling Portus in Minecraft, I decided it couldn’t hurt to give myself the introduction to GIS I so sorely needed. By happy circumstance, Esri, developers of the ArcGIS packages had just started a MOOC in conjunction with Udemy. So I signed up for that and, for the last couple of weeks, I’ve been catching up (I started four weeks late) and completing the course.

It made for a brilliant introduction to GIS (for GIS virgins like me but also, it seems from the comments, for more experienced users). Taught (mostly) by Linda Beale, with introductions from David DiBiase. I noted with interest that the Udemy MOOC engine (of course not really MOOC software, as most of Udemy’s courses are paid for) incorporated a time-stamped comments feature a bit like Synote, the one my colleagues are developing, but not quite as capable.

David and Linda introduce the course while I play with the notes function
David and Linda introduce the course while I play with the notes function

There were song titles to look out for, smuggled into Linda’s lectures, and quizzes that were the right level of challenging, to help review your learning. The songs and some trick questions in the quizzes betrayed a mischievous sense of humor, which I enjoyed. Some students didn’t – upset, I guess, at spoiling a 100% record, but these were quizzes not exams.

Each week included one or two case studies, wherein we got to use an online version of Esri’s ArcGIS to solve data analysis problems: where to locate a distribution centre, or monitor Mountain Lions, or build mixed use accomodation, for example. These case studies were great fun… to begin with. But,  as I caught up with my fellow students, and we all started working on the ArcGIS servers on the same day, the software couldn’t cope, and timed-out or returned errors on analysis. So in fact I haven’t done the three case studies. Which I found very frustrating.

I’ve got a few weeks to go back and try them again when its not so busy, but I’ve spend the greater part of the last couple of months studying MOOCs and not getting on with my own work, so I was hoping to call it quits today. Next week, I’m going to experiment with Twine.

More maths

plotCorrFunPrefsAndLudic

Last time I finished with this matrix of scatter-plots, ordered by the magnitude of correlation. But what does it actually mean? Lets take a step back, and look at those derived variables. I ask R to describe the table of variables that I created previously, which include the notional ludic.interest variable and the Hard, Serious, Easy and People fun preference variables. These are handily additional columns created by R on the end of the table of original data, so I ask R to describe just those columns:

> describe(newdata[90:94])

This gives me a little table describing the variables. It’s where the mean values I quoted last week came from. Looking at it again this week its interesting to note the ranges of some of the scores, but the first thing I notice is that the Standard Deviation (SD) of the ludic.interest variable is noticeably lower than the fun preference variables. Those range between 15.31 for the Hard fun variable, and 16.75 for the Serious fun variable. While the ludic.interest variable is 11 (actually 0.11, but remember that the other fun variables are between 0-100 and ludic.interest between 0-1). The range of score for ludic.interest  is tighter too:

VARIABLE RANGE
ludic.interest 51
H 66
E 76
S 89
P 76

The Serious fun preference questions thus showed the most division among gamers. What’s particularly interesting is that the lowest score in that range is zero, so at least one respondent vehemently disagreed with all the statements associated with that preference. The same is true of the People fun variable.

That matrix at the top of the post suggests that despite (or because of?) the wide range of the Serious fun variable, its one that shows some correlation with all the other variables. Stronger correlation, in fact, than the People fun variable, which correlates poorly with the all the variables except Serious fun.

Lets look at that in more detail. The Serious fun variable correlates most with the Easy fun variable,  the value of the correlation coefficient  (r) = 0.52, plot the two variables with a regression line and it looks like this:
plotSeriousEasy

Not a bad, shall we say “moderate” relationship. For every point up the Easy fun preference scale somebody scores, they are likely to score 0.54 higher on the Serious fun scale. With a standard error of 0.09 the T value for this relationship is 5.9, and the corresponding p value is very low at 0.00000006. So this appears to be a statistically valid relationship.

(You can see that respondent who disagreed with all the Serious fun statements in the bottom left, they weren’t that keen on Easy fun either, but at least scored 22 for that. Looking at the table of data I find its the same respondent who also disagreed with all the People fun preferences, and scored 43.5 for hard fun, and 33 (0.33) for ludic.interest.)

Lets compare how that looks with the plot for the relationship between preferences for Hard fun and People fun, where the correlation coefficient is just 0.02:

plotHardPeople

Hardly any relationship at all then.

 

Gamer data: Fun preferences

After last week’s hair-pulling day of frustration, I’ve made I bit more progress. The survey contained seventeen questions which were based on the theory of four types of fun, set out by Nicole Lazzaro. These were 101 point  Likert scales, wherein the participant indicated their agreement with a statement, using a slider with no scale and the slider “handle” position set randomly, to reduce systematic bias. Of course, these being Likert disagree/agree scales, I was still expecting clumping at one end or the other despite my attempts to reduce that by making them 101 point scales. And so it proved, in many cases, as these histograms of the four questions I used as indicators of a preference for “Serious Fun” show.

Histx4SeriousFunIndicators

I never intended to do any correlations with the responses to the individual questions though. Instead my plan was to average out each individual’s responses to the indicator questions to create something more like a continuous variable which I could correlate with other responses. Doing that for the responses to the Serious Fun indicator questions, for example, turns the four clumpy histograms above into something a lot more like a “normal” curve.

histSeriousFun

 

And the distributions of all four “Fun preferences” look like this (as curve plots this time, in case you were getting bored of histograms):

plotx4FunProfiles

You’ll note straight away, that the “Hard Fun” curve is the one that most resembles a “normal” bell curve. Easy Fun has a distinct negative skew, and in fact all three others have a slight negative skew. And there’s a distinct preference apparent in this sample for Hard and Easy Fun over Serious and People Fun. In fact, the most popular preference in this sample is for Easy Fun where the mean stands at 70.8 and the median (in this most skewed of the four distributions) at 73.7. The mean of the Hard Fun distribution is 66.61, in third place is Serious Fun with a mean of 54.06 and trailing behind is People Fun with a mean of 42.22.

I was a bit surprised that People Fun scores so poorly in this sample, but I guess I shouldn’t be because one of the questions I used to indicate a preference for People Fun was “I don’t actually like playing games all that much” which I don’t suppose is going to find much agreement among gamers after all.

Which begs the question “would People Fun preference correlate negatively with the Ludic Interest vector I created last week?” But rather than look at that relationship on its own, lets see how all the derived variables I’ve created relate to each other.

plotCorrFunPrefsAndLudic

So, people fun correlates a bit with Serious Fun, but little else. Ludic Interest correlates less well with Hard Fun than I might have expected. Though the Ludic Interest variable was admittedly an afterthought and the selection of games from which it was derived by no means scientific. I might rethink that whole  section next time. The Serious Fun vector correlates with other variable more than I expected, and the little scatter plots look interesting, so next time I’ll investigate some of these relationships more deeply.

Bodiam data again

Yesterday, I said that I expected to see a strong negative correlation between “I didn’t learn very much new today” and “I learned about what Bodiam Castle was like in the past.” In fact, when I ran the correlation function in R, it came out at a rather miserly 0.33, much lower than I expected. So I asked R to draw me a scatterplot:

ScatterRegression(ghb$Didn.t.learn, ghb$Learned)

And there it is, some correlation, but not as much as I was expecting. (I added text labels to each datapoint, with row numbers on, as a quick and dirty way to see roughly where a single point represents more than one respondent.) I think this demonstrates two things. The first is that Likert scales can look awfully “categorical” when compared with true continuous numerical values. And the second is that I need a larger sample (if only to lessen the influence of outliers such as row 1, up the in the top right hand corner, which I fear maybe my own inputting error on the the first interview).

So rather than faff around with individual pairings, I created a correlation matrix of all the seven point Likert scale questions. Other than the learning questions I mentioned in my last post, I used the Likert agreement scale  for the following statements:

  • My sense of being in Bodiam Castle was stronger than my sense of being in the rest of the world
  • Bodiam Castle is an impressive sight
  • I was overwhelmed with the aesthetic/beauty aspect of Bodiam Castle
  • The visit had a real emotional impact on me
  • It was a great story
  • During my visit I remained aware of tasks and chores I have back at home/work
  • I enjoyed talking about Bodiam Castle with the others in my group
  • Bodiam Castle is beautiful
  • I wish I lived here when Bodiam Castle was at its prime, and
  • I enjoyed chatting with the staff and volunteers here

Looking through the results matrix, the strongest correlation that stands out (at 0.65) is between “It was a great story” and  “I learned about what Bodiam Castle was like in the past.” Which is nice. But remember, correlation ≠causation. Here, I wouldn’t even know where to start, did they admit to learning because the story was great? Or was the story great because they learned about it? And of course neither distribution can be called “normal.” The “correlation” is helped by the skew in both distributions of course.

Hist(ghb$learned$story1x2)ScatterRegression(ghb$Great.story, ghb$Learned)

There’s also an interesting strong correlation(0.57)  between “I enjoyed talking about Bodiam Castle with the others in my group” and “I learned about what Bodiam Castle was like in the past.” Though I’m not suggesting cause and effect here, I’d like to follow up on this.

Histx2+Scatter(ghb$Talking.group$Learning)

Similarly, there are correlations between the responses which agreed that Bodiam had a great story, and those who enjoyed chatting within their group as well as with staff.

What about the lowest in the matrix? Rather scarily, there seems to be zero correlation between the “Didn’t learn anything new” statement and emotional impact. I’ve already told you about my caveats over emotional impact as something you can measure this way anyway, but zero correlation (when rounded to two decimal places)  sets alarm bells ringing about one of these arrays.

Histx2+Scatter(ghb$Did.nt.learn$Impact)

Anther correlation from the matrix is between “My sense of being in Bodiam Castle was stronger than my sense of being in the rest of the world” and “During my visit I remained aware of tasks and chores I have back at home/work”, which I guess could/should be expected. It does raise an interest question for the future though. If I had to chose just one of these statements to include in a future survey, which would it be? Based on these Histograms, I might chose the former, if only because it looks more “normal”:

Histx2(ghb$sense$home.work)

Its also interesting that, “Bodiam Castle is an impressive sight” correlates strongly with “Bodiam Castle is beautiful”(0.54) but less strongly with “I was overwhelmed with the aesthetic/beauty aspect of Bodiam Castle” (only 0.37). Those last two correlate strongly (0.55) with each other,  of course.

Histx3+Scatterx3(ghb$aesthetics)

The “I wish I lived here when Bodiam Castle was at its prime” and “What I learned on the visit challenged what I thought I knew about medieval life,” statements didn’t yield anything particularly interesting. I might drop them from the next survey. But what troubles me most, in an existential way, is the correlation between “I was overwhelmed with the aesthetic/beauty aspect of Bodiam Castle” and “The visit had a real emotional impact on me”.

ScatterRegression(ghb$aesthetic.beauty ~ ghb$Emotional.impact)

My whole career has been build around the idea that people want to know stuff, to learn things about places of significance. While its nice that aesthetics and emotions are closely bound, is there any space for the work I do?

Using R in anger

I’m expecting to be emailedl the link to my final exam in the Coursera statistics course this weekend, and by way of revision I’m using my own data in R for the first time.

The first challenge is making sure my data is R “fit” – I entered it into Excel the first time, and though I made sure that even category data was entered numerically, I did some foolish things. The first of which was entering the data for the recommendation rating in catagories rather than just putting the raw score in. So I’ve just gone back to my original collected data and re-inputted the raw numbers.

Then I needed to get the excel data into R. I like this post on that subject, which starts off  with “don’t put it into excel in the first place.” It seems the quickest way is [Ctrl]+[C] copying the relevant data in Excel, then in R entering > gh <- read.table(“clipboard”) lets try that.

Warning message:
In read.table(“clipboard”) :
incomplete final line found by readTableHeader on ‘clipboard’

Oh dear, that doesn’t look good, what does the table look like in R?

V1 V2 V3
1 gh <- read.table(“clipboard”)

Hmmm no. So then I use the not-lazy method, exporting the Excel sheet into a tab delimited text file. Then reading the table (remembering this time to add the argument “header = T”). Well it seems to work, but the last column (my carefully inputed actual values for recommend) isn’t there. Why? I still don’t know, having tried again and again, adding extra blank buffer columns and reordering columns. All I get is:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 57 elements

So lets try a third method, exporting the excel sheet to the csv (comma separated values) file and using read.csv(“gh.csv”, header = T) rather than read.table.

Aha success! Though some (but not all) of the blank fields come up “NA” rather than empty. I’ll work out if I need to worry about that later.

Lesson one learned. “Don’t put it into excel in the first place.”

But now I have my data in R, I can at least have a look at it by using the describe() function (as long as I have loaded the package library(sm), which I have). The table that produces tells me that some of my category date is easily identifiable as such, the “Very enjoyable” five category Likert question is full of NAs (and NaNs – don’t know what that means). Same with the  categories describing the groups that the respondents visited with, and the weather. That’s fine, I don’t need to worry about those yet. More importantly the continuous numerical data has returned a number of descriptive statistics.

I asked a number of questions scored on a seven point Likert scale, about how much respondents agreed with statements that: the experience added to their enjoyment of the visit, and added value to their visit;  their choices changed the story; they felt an emotional impact; it was a great story; they learned about the history of site; they were inspired to learn more; and, they enjoyed listening as a group. I also asked them to score on a ten point scale how likely they were to recommend the experience to friends and family.

You could argue that being a Likert scale, this is categorical data, but I think there’s a strong case for arguing that it is continuous, especially if you collect the data with something like QuickTap survey. That’s not the case with other Likert scale questions, like the one I asked based on a National Trust survey question about how much they enjoyed their visit. Those five options are clearly categorical.

The describe function shows a number of summary statistics for all the data I collected. For each question I asked, I can see (among other things):

  • the number of respondents;
  • the mean;
  • the standard deviation;
  • the median;
  • the range;
  • the skew;
  • the kurtosis; and,
  • the standard error.

I got 39 responses for most questions, one person declined to answer the the question on story so I’ve only got 38 there. And only 34 answered the question on being a group (the other five came on their own).

Looking at the means and medians, you can see that some questions (adding to enjoyment; adding value; and learning) were answered very positively – the median is the top score, 7, and the means range from 6.44 to 6.59 and nobody scores less than four. Make a histogram of these and you’ll see they are not “normal” curves. They are negatively skewed (“the skew is where there’s few”) and indeed, the describe function reports a skew of -1.72 for the “Added to my enjoyment” question.

Hist(gh$Added)

In contrast, the answers to the question on emotional impact are approaching something like “normal”:

Hist(gh$Emotion)

All of this might look a bit familiar, like the bar charts I got from Excel (with some wrestling to get them on WordPress) in this post. I’ll admit I’m somewhat covering old ground here, but bare with me while I get my head around this new way of working before moving on. I have to say the ease of saving these histograms into jpeg to upload onto WordPress makes me never want to see another Excel sheet again. Doubtless I will, however. Lets try an other histrogram, plotting responses to the 10 point Recommendation scale:

Hist(gh$Recommend)

Cool. Again, not “normal” and so I know I shouldn’t be doing what I’m about to do, but I want to see if there’s any correlation between some of the responses to the 7-point Likert questions I asked, and Recommendation. So, lets try some regression!

ScatterRegression(gh$Recommend~gh$Emotion)

So, there is some correlation.  (Remember correlation does not equal causation.) This regression model of recommendation and reported emotional impact tells us that the predicted intercept (Recommendation with zero emotional impact) is 6.88, and the regression coefficent is about 0.49. R squared (the proportion of variance in recommendation explained by emotion impact) is about 30%, which seems reasonably chunky.

Lets finish off with a Null Hypothesis Significance Test (NHST).

Pearson’s product-moment correlation tells us that t = 3.9748, degrees of freedom  = 37,  and  the p-value = 0.000314. Its that p-value is less than 0.5 which the the arbitrary number that marks the divide between the null-hypothesis being true and false. It measures the probability of my obtaining these (or more extreme) data,  given the assumption that there is no relationship between the variables. the  As my p-value is less that 0.5, and can “reject the null”. Of course I might have made a Type 1 error here.

So what have I learned?

First of all, I’ve learned how long this takes. I’m only about a third of the way through what at learned on the course, and the email about my final exam is due tomorrow!

I’m also aware that I’ve “proved” a correlation between a variable with a “normal”(ish) distribution, and another with a very different distribution. My revision hasn’t got to what to do about non-normal distributions yet.

I’ve also learned that I need more data. When I’m making a scatterplot, more dots make me feel more confident in my findings. Though I’m reminded that NHSTs are biased by sample size, so that confidence can be false!

I also want more variance in my data. Likert scales on paper are limited, but in future I’m inclined to offer ten or eleven point scales where I can on paper, and as I discovered with the data I collected at Bodiam, touch-based sliders offer an opportunity to create 100 point scales which should discourage “clumpyness”.

And I’m still convinced I can be cleverer with the questions, to obtain more “normal” data distribution curves.