September 12, 2024 | tags: bash, python, -- (permalink)
At my current job, I felt the need for more bash
related efficiency while dealing
with large log files generated by PLCs. It is a high risk industrial environment, with a lot
of restrictions and bureaucracy for using FOSS tools. This has set a very interesting precedent
to myself as the first job that I ever needed to be more linux oriented. I consider myself
a regular linuxer: with a not so broad toolset knowledge and very shallow depth on how to use them.
With that in mind, I took some time to read more about some of them and how they could help me deal with daily
challeges. The most useful and powerful, so far, is grep
. With power for query and
finding things, it is my most used. It has lots of options and flexibility and dealing with large
and multiple log files or filtering for specific content using it with regular expression is comes into play.
It feels weird to say this since grep
is a acronymn for "Global Regular Expression Print", but I never felt
the need to actually use regex in grep in my 15 years of linux desktop usage. Even when I eventually needed
to search for stuff on sysadmin related tasks or package builds, locate
would suffice. grep
, however, has
a whole deal of combination with other tools such as sed
, find
and awk
.
The truth is, along with the avoidance, I was never good with regex and I always managed to go by without it. Of course, eventually here and there regex comes in handy but it was never a daily necessity... until now.
Why REGEX?
Although we forget sometimes, it have a huge importance on how we build our programming tools. From IDE to Compilers, regex is how we deal with day-to-day discoverability in programming languages related tasks. Its history starts by its creation at the hands of Ken Thompson, which needed a powerful way to look for content in a text editor through advanced pattern matching. The idea came from the algebra notation created by mathematician Stephen Kleen, with the same namesake, inspired by the McCulloch-Pitts neural automata model.
And as an automata is basically how we should look at how regex works.
So I'll share what I've been up to with it. For purpose of practices, I'll use GNU grep (v3.11)
and the public domain corpus of Project Gutenberg hosted book The Illustrated Key to The Tarot
,
in a way to have a search context if we want to explore particular info, have some mark-up
on it so we can search for some HTML
and CSS
related syntax if needed.
Intro
I managed to get my hands on Jeffrey E.F Friedl's Mastering Regular Expressions and it's been a good book so far on what I wanted to learn about regex. It isn't too formal on the subject, to be honest. It feels like a conversation with the author, which it's nice in its own way. One statement that caught me and helped on how to properly learn regex is that we can think of it as a programming language on its own, with syntax, "keywords", conditionals, repetitions and so forth.
I think this sets the proper expectation of the workload and effort to learn it. In a way, I subdivided it in bits so the learning path can be similar to learning any other programming language.
Syntax
Characters and Symbols
The atom of regex universe is the character. Since we deal with words to build pattern matching with it, we need to break down how to look at the basic component itself.
The word 'banana' is composed of 3 letters a
, 2 letters n
and 1 letter b
.
To look at it though, we need to be more precise, since the sequence matters. It is
b followed by a followed by n followed by a followed by n followed by a
that determines the word banana
that we want to find.
We could do worse with Banana
. It can be interpreted as the same word in english but
it is not the same word pattern written in a text file. It is B
, i.e uppercase b,
which is a different character ! So
B followed by a followed by n followed by a followed by n followed by a
and that determines the word banana
as well but not the pattern Banana
.
We could do even worse with b@N4nA
. It is not a valid english word, but we can still
interpret as 'banana' in a forced meaning extrapolation but it is indeed a different pattern.
So it is kind of needed to be pedantic on characters and symbols while dealing with regex pattern matching
grep
Also created by Ken Thompson, it a piece of the search engine of the 'ed' text editor
extracted and put in a command line tool. Here we are going
to use the GNU grep implementation version 3.11
, always passing the -E
flag. From the man page:
Pattern Syntax
-E, --extended-regexp
Interpret PATTERNS as extended regular expressions (EREs, see below).
...
In GNU grep, basic and extended regular expressions are merely different notations for the same pattern-matching functionality.
On pattern matching engine it is a difference of notation, but it isn't the entire story. Some functionalities that are available in the extended matching (ERE) is not available in the basic one (BRE). So we will simply ignore the basic from now on.
Hands on
Download the The Illustrated Key to The Tarot and save as illustrated_key_to_the_tarot.html somewhere, for instance on your download folder located at /home/your_user/downloads. This place will be mentioned from now on as $WORKSPACE.
Lets look for 'Fool' pattern in the text
$ grep -E 'Fool' $WORKSPACE/illustrated_key_to_the_tarot.html
card, number nothing—<i>The Fool, Mate, or Unwise Man</i>. Court
arrangement of the cards has never transpired. The Fool carries
and Fool, Venus and the Star, Mars and the Chariot, Saturn and
the list of which is as follows: Fool, Emperor, Pope, Lovers,
zero card of the Fool is allocated, as it always is, to the place
but he only made bad worse by allocating the Fool to the place
<div class="figcenter"> <img src="images/i_082.jpg" alt="The Fool" width="234" height="400" />
of the office of Mystic Fool, as a part of his process in
Fool signifies the flesh, the sensitive life, and by a peculiar satire
regarding the Fool, which is the most speaking of all the
Let the reader compare them with symbols like the Fool, the
<p class="ind15"><i>Zero.</i> <i>The Fool.</i>—Folly, mania, extravagance, intoxication,
<p>Thus, the Fool may indicate the whole range of mental phases
separately and in combination the Magician, the Fool, the High
is vague—about <span class="smcap">B. C.</span> 300. The Fool represents the primordial
the Fool is its fermentation; and, in fine, the last card, or the
We can see that there is a lot to be done just to get some context of the word.
Now, lets look for 'fool' pattern in the text
$ grep -E 'fool' $WORKSPACE/illustrated_key_to_the_tarot.html
(Nature) is foolishness with men does not create a presumption
that the foolishness of this world makes in any sense for Divine
<i>Reversed</i>: Failure, foolish designs. Another account speaks of
the pains of fools.</p>
not Egyptian at all. To be frank, these kinds of foolery may be
The pattern was found but it might not be what we expected, since usually we want to find a word such as 'fool' happening as it's own.
This will lead us to the need to find non-embedded patterns in files.