Science is looking closely at all sorts of things you probably don’t know about! — Lily Grosh
A friend shared this little nugget of wisdom from her daughter and I loved it. Sure, it’s not as complete a definition as Francis Bacon or Karl Popper might offer. Nevertheless, it captures an essential feature of science too easily overlooked. We say science deals with facts and proofs, we hear how science is right and just about everything else is wrong, and we get the impression science works exclusively with the known and the certain. While we have accumulated a body of knowledge through scientific investigation, the unknown plays a big part in what scientists do all day.
Scientists love the unknown; it gives them something to do. When they don’t find new questions to ask, they can be disappointed. The Large Hadron Collider let physicists down, for example. Yes, the Higgs boson discovery was genuinely a big deal. It provided important confirmation of a scientific theory; those confirmations distinguish science from just making stuff up. But without anything unexpected, there’s no need to come up with new explanations or figure out ways to confirm them. There’s no opportunity for creativity, and like many of us scientists are wired to create.
Here’s a tiny glimpse of the unknown from my day job. We collect healthcare data including patients’ date of birth. I was exploring the quality of that data to see how often birth dates are formatted or interpreted incorrectly. I made the chart below to help me. The year of the birth date runs horizontally, increasing from left to right. The day of the year, from 1 to 366, runs vertically from bottom to top. The number of patients with a given birth date is represented by the color; smaller numbers are at the red end of the spectrum and larger numbers are at the blue/purple end. We don’t expect the number of birth dates to vary over the course of the year, so any patterns in the vertical dimension might indicate a quality issue to investigate. I encourage you to play along with me; take a look at the chart and see what jumps out at you.
Did you have a chance to look over the chart? Great. Anything stand out to you? Here are a few things I noticed. As you look left to right, the number of patients increases smoothly. We expected this, since there are more 99 year olds than 100 year olds, more 98 year olds than 99 year olds, etc. But yellow transitions to green more sharply; you can actually see the Baby Boom starting in the mid 1940s.
The left-to-right upward trend reverses in the mid to late 1990s. Population demographics may play a role, but we also need to remember the source of these birth dates. Young people also don’t go to the hospital as often as older people. Until you get to the very young people, those 3 and under; they do visit in pretty large numbers. You might also notice that I made this chart at the end of 2015, since there are almost no patients with birth dates in late 2015.
All of those observations are neat and confirm that our chart reveals expected and known patterns. They don’t tell us anything new yet. More relevant to my job are the colors along the very bottom of the chart, for the first day of the year. They may be hard to see, but the first day of the year always has the most births. That’s not a real fact about people; sometimes the hospital doesn’t get a full birth date and can only narrow it down to a year, so apparently they default to January 1st of that year. Actually, January 1st, 1900 and January 1st, 1901 are by far the most common birth dates we receive (you can’t tell from this chart because I had to leave them off to be able to see all the other patterns). Those are the defaults in the hospital systems when not even a birth year is collected.
The two most surprising (to me) observations are left. First, do you see the diagonal lines? They start subtly in the late 1960s and by the early 2000s they are very pronounced. At first I thought I must have done something wrong when I made the chart; I never expected such a striking pattern. The only other explanation I could think of is that more people are born on certain days of the week than others. Turns out, that is actually how the world works, and people who study such things are well aware. Inductions, scheduled C-sections and other features of modern childbirth have reduced weekend births proportional to weekday births. This shows up as diagonal patterns because a year is 52 weeks and 1 (or 2) days, so the day of the week relative to the day of the year shifts down 1 (or 2) lines each year. Even though I was rediscovering a well known feature of reality, it was still exciting because it was unknown to me.
The other surprise are those two dark blue spots in the late 1950s within a large green region. They turned out to represent two individuals who visited hospitals nearly every day for years, causing their personal birth dates to show up disproportionately. While it is known that some people use the hospital frequently, we don’t always expect to detect them in anonymized data. In particular, one of the people appeared to visit multiple hospitals, a behavioral pattern that is hard to identify even with complete data. While not earth shattering, that little discovery was genuinely novel and relevant to our objectives at work, which was very satisfying.
Hopefully you enjoyed that little exploration. Maybe it gave you a little taste of what it’s like to look closely at something you don’t know much about. I also hope it will explain why highlighting what a scientist doesn’t know isn’t a way to score points. For example, last week we discussed how little we know about consciousness. While it might be tempting to say “Aha! See, science doesn’t know everything!”–tempting and also accurate–that observation won’t convince a scientist there’s a problem. On the contrary, you’re just pointing them in the direction of a blank canvas and handing them a paintbrush.