One of the greatest pleasures of home ownership is the opportunity to work in the garden. Gardening is fulfilling for several reasons. The accomplishment is satisfying and tangible, unlike a lot of office work. Gardening is great physical exercise, involving a range of low-impact and core-intensive body movements. Gardeners get time outdoors, bolstering vitamin D intake and exposure to fresh air. By handling soil, our body develops a resistance to the germs and bacteria, or so rumor has it. The work is solitary and meditative, improving mindfulness. The home-grown and fresh-picked produce tastes better and is higher in nutrients. Gardening is a kind of cure-all for wellbeing.
Soil cultivation is extremely similar to the practice of cleaning a data set.
My first round of soil cultivation was a patch of land that previously had a garden shed sitting on top of it. There was no organic matter in the soil at all; just a dense patch of dust. The soil required a major intervention to become useful. This is also true about a new data set. It’s great to have new data, but I just know it’s going to require a lot of attention before I can use it properly. The data will lack clear labels, there will be columns or fields that are useless in some way, and some data points need to be converted. Sometimes it simply arrives in the wrong format, such as on paper or ascii, or built around a different software environment. The certainty that I must put work into it in order to get something back, turns this into “real” work.
With data I typically find that some fields are all wrong, and I need to track down or create “lookup” tables that convert the raw data into something that I know is accurate. I don’t like throwing out data. I prefer to just keep the dirty data on the left hand side of a spreadsheet, and to the right of a thick, vertical line, create a modified column or field. The raw data is black, and the modified data is has color, so I know what I’m working with. I also give the modified data more explicit labels, in succinct but plain English. Many dubious data fields are suddenly rendered “accurate” by a good label.
As with my fully-remediated soil, my fully-amended data set means that I am ready-to-roll. I can work with a converted and color-coded batch of cultivated data, culled of garbage and meaningless fields, and turned into something useful.
I know that some people think of a garden as a place where plants grow. And some people think of data as something that is capable of producing analytic insights. In both cases, there is something deeply human about taking a mess – soil or data – and turning it into something more. We advance civilization one step at a time, one cubic foot at a time, one data point at a time. Sometimes we just need to break a sweat, get some sun, work our bodies, and build immunity.