2021-07-20
|~4 min read
|719 words
I was asked to produce a report recently and, as these things do, it required some data cleaning.
Fortunately, I was working a colleague who was more comfortable with sed
and data manipulation via the command line generally than I was.
What follows is a contrived example of some of the manipulations we made in order to clean our data. While it’s unlikely that this will ever be useful again for the exact situation I came across, each piece is a new tool that I can reach for in the future. And spending the time now to understand them is how I’ll remember to actually reach for them!
Let’s get cracking on it then!
First up - let’s describe some sample data. We’ll use comma-separated values, though the principles would apply as well to data delimited by tab (TSV) or pipes.
Our example.csv
:
"pk","fk","hasAttribute","_count"
"69870","80576","true","12"
"142927","80776","true","6"
"98738","57138","false","4496"
"98828","57138","true","200"
Now that we have that in order, we can start doing some analysis. The questions we’ll ask:
We’ll cover the first one today and the others later (i.e., as I learn how to do them!).
Thinking about the transformations we’ll want to do here:
To focus in on a specific column, we can use cut
:
$ cat example.csv |
cut -d, -f2
"fk"
"80576"
"80776"
"57138"
"57138"
From the manual pages for cut
1:
-d, --delimiter=DELIM
use DELIM instead of TAB for field delimiter
-f, --fields=LIST
select only these fields; also print any line that
contains no delimiter character, unless the -s option is
specified
-d,
then indicates that we are working with a comma delimited file.
-f2
says we are interested in the second list item. The list here is what’s created by the delimiters. So, the list generated by the first row was "pk"
,"fk"
,"hasAttribute"
, and "_count"
.
If we’re able to strip all of the quotations from the file, in this case because we want to be able to treat our keys as numbers, we can use sed
:
sed 's/"//g' example.csv
pk,fk,hasAttribute,_count
69870,80576,true,12
142927,80776,true,6
98738,57138,false,4496
98828,57138,true,200
In our case, since we only care about the keys, we can remove the header.
To do this, we can use grep
’s -v
flag to invert the match:
% man grep
...
-v, --invert-match
Selected lines are those not matching any of the specified patterns.
Combining this with our cut
from above:
$ cat example.csv |
cut -d, -f2 |
grep -v fk
"80576"
"80776"
"57138"
"57138"
Now that we have just the data from the one column we’re interested in, we can collapse the values into a single row.
To accomplish this, we can use a bit of a trick with xargs
.
As a reminder (from the manual):
The
xargs
utility works by reading “space, tab, newline and end-of-file delimited strings from the standard input” and executing a utility with the strings as arguments.
% cat example.csv |
xargs echo
pk,fk,hasAttribute,_count 69870,80576,true,12 142927,80776,true,6 98738,57138,false,4496 98828,57138,true,200
Notice that the rows are space delimited now. We’ll come back to this in a second.
We now have almost all of the pieces we’ll need to get our list of foreign keys:
% cat example.csv |
sed 's/"//g' |
cut -d, -f2 |
grep -v fk |
xargs echo |
sed 's/ /,/g'
80576,80776,57138,57138
Et voila! The only thing we needed to do to get our comma separated list was another sed
command that would replace the spaces provided by xargs
when it collapsed the echo
into a single row with a comma.
cut
manual pagesHi there and thanks for reading! My name's Stephen. I live in Chicago with my wife, Kate, and dog, Finn. Want more? See about and get in touch!