combining sed, cut, and xargs to clean data

2021-07-20

~4 min read

719 words

I was asked to produce a report recently and, as these things do, it required some data cleaning.

Fortunately, I was working a colleague who was more comfortable with sed and data manipulation via the command line generally than I was.

What follows is a contrived example of some of the manipulations we made in order to clean our data. While it’s unlikely that this will ever be useful again for the exact situation I came across, each piece is a new tool that I can reach for in the future. And spending the time now to understand them is how I’ll remember to actually reach for them!

Let’s get cracking on it then!

First up - let’s describe some sample data. We’ll use comma-separated values, though the principles would apply as well to data delimited by tab (TSV) or pipes.

Our example.csv:

example.csv

"pk","fk","hasAttribute","_count"
"69870","80576","true","12"
"142927","80776","true","6"
"98738","57138","false","4496"
"98828","57138","true","200"

Now that we have that in order, we can start doing some analysis. The questions we’ll ask:

What is a comma separated list of foreign keys as numbers?
What is the list of primary keys where the attribute is true? (Answered here)
What about a unique list of comma separated list of foreign keys?

We’ll cover the first one today and the others later (i.e., as I learn how to do them!).

Comma Separated List Of A Single Column

Thinking about the transformations we’ll want to do here:

We only care about the second column
We want to “strip” the quotations
We want to remove the header
We want to “collapse” the rows into a single row

Narrowing Down To A Single Column

To focus in on a specific column, we can use cut:

$ cat example.csv |
cut -d, -f2

"fk"
"80576"
"80776"
"57138"
"57138"

From the manual pages for cut¹:

-d, --delimiter=DELIM
        use DELIM instead of TAB for field delimiter

-f, --fields=LIST
        select only these fields;  also print any line that
        contains no delimiter character, unless the -s option is
        specified

-d, then indicates that we are working with a comma delimited file.

-f2 says we are interested in the second list item. The list here is what’s created by the delimiters. So, the list generated by the first row was "pk","fk","hasAttribute", and "_count".

Stripping Quotations

If we’re able to strip all of the quotations from the file, in this case because we want to be able to treat our keys as numbers, we can use sed:

sed 's/"//g' example.csv

pk,fk,hasAttribute,_count
69870,80576,true,12
142927,80776,true,6
98738,57138,false,4496
98828,57138,true,200

Remove Our Header

In our case, since we only care about the keys, we can remove the header.

To do this, we can use grep’s -v flag to invert the match:

% man grep

...
-v, --invert-match
        Selected lines are those not matching any of the specified patterns.

Combining this with our cut from above:

$ cat example.csv |
cut -d, -f2 |
grep -v fk

"80576"
"80776"
"57138"
"57138"

Collapsing Rows To A Single Row

Now that we have just the data from the one column we’re interested in, we can collapse the values into a single row.

To accomplish this, we can use a bit of a trick with xargs.

As a reminder (from the manual):

The xargs utility works by reading “space, tab, newline and end-of-file delimited strings from the standard input” and executing a utility with the strings as arguments.

% cat example.csv |
xargs echo

pk,fk,hasAttribute,_count 69870,80576,true,12 142927,80776,true,6 98738,57138,false,4496 98828,57138,true,200

Notice that the rows are space delimited now. We’ll come back to this in a second.

Putting It All Together

We now have almost all of the pieces we’ll need to get our list of foreign keys:

% cat example.csv |
sed 's/"//g' |
cut -d, -f2 |
grep -v fk |
xargs echo |
sed 's/ /,/g'

80576,80776,57138,57138

Et voila! The only thing we needed to do to get our comma separated list was another sed command that would replace the spaces provided by xargs when it collapsed the echo into a single row with a comma.

Footnotes

¹ The cut manual pages

Command Line Data Processing: Adding Awk To the Arsenal

Hi there and thanks for reading! My name's Stephen. I live in Chicago with my wife, Kate, and dog, Finn. Want more? See about and get in touch!