Regular expressions in AWK programming What Why and How

first_imgAWK is a pattern-matching language. It searches for a pattern in a file and, upon finding the corresponding match, it performs the file’s action on the input line. This pattern could consist of fixed strings or a pattern of text. This variable content or pattern is generally searched with the help of regular expressions. Hence, regular expressions form an important part of AWK programming language. Today we will introduce you to the regular expressions in AWK programming and will get started with string-matching patterns and basic constructs to use with AWK. This article is an excerpt from a book written by Shiwang Kalkhanda, titled Learning AWK Programming. What is a regular expression? A regular expression, or regexpr, is a set of characters used to describe a pattern. A regular expression is generally used to match lines in a file that contain a particular pattern. Many Unix utilities operate on plain text files line by line, such as grep, sed, and awk. Regular expressions search for a pattern on a single line in a file. A regular expression doesn’t search for a pattern that begins on one line and ends on another. Other programming languages may support this, notably Perl. Why use regular expressions? Generally, all editors have the ability to perform search-and-replace operations. Some editors can only search for patterns, others can also replace them, and others can also print the line containing that pattern. A regular expression goes many steps beyond this simple search, replace, and printing functionality, and hence it is more powerful and flexible. We can search for a word of a certain size, such as a word that has four characters or numbers. We can search for a word that ends with a particular character, let’s say e. You can search for phone numbers, email IDs, and so on, and can also perform validation using regular expressions. They simplify complex pattern-matching tasks and hence form an important part of AWK programming. Other regular expression variations also exist, notably those for Perl. Using regular expressions with AWK There are mainly two types of regular expressions in Linux: Basic regular expressions that are used by vi, sed, grep, and so on Extended regular expressions that are used by awk, nawk, gawk, and egrep Here, we will refer to extended regular expressions as regular expressions in the context of AWK. In AWK, regular expressions are enclosed in forward slashes, ‘/’, (forming the AWK pattern) and match every input record whose text belongs to that set. The simplest regular expression is a string of letters, numbers, or both that matches itself. For example, here we use the ly regular expression string to print all lines that contain the ly pattern in them. We just need to enclose the regular expression in forward slashes in AWK: $ awk ‘/ly/’ emp.dat The output on execution of this code is as follows: Billy Chabra 9911664321 [email protected] M lgs 1900Emily Kaur 8826175812 [email protected] F Ops 2100 In this example, the /ly/ pattern matches when the current input line contains the ly sub-string, either as ly itself or as some part of a bigger word, such as Billy or Emily, and prints the corresponding line. Regular expressions as string-matching patterns with AWK Regular expressions are used as string-matching patterns with AWK in the following three ways. We use the ‘~’ and ‘! ~’ match operators to perform regular expression comparisons: /regexpr/: This matches when the current input line contains a sub-string matched by regexpr. It is the most basic regular expression, which matches itself as a string or sub-string. For example, /mail/ matches only when the current input line contains the mail string as a string, a sub-string, or both. So, we will get lines with Gmail as well as Hotmail in the email ID field of the employee database as follows: $ awk ‘/mail/’ emp.dat The output on execution of this code is as follows: Jack Singh 9857532312 [email protected] M hr 2000Jane Kaur 9837432312 [email protected] F hr 1800Eva Chabra 8827232115 [email protected] F lgs 2100Ana Khanna 9856422312 [email protected] F Ops 2700Victor Sharma 8826567898 [email protected] M Ops 2500John Kapur 9911556789 [email protected] M hr 2200Sam khanna 8856345512 [email protected] F lgs 2300Emily Kaur 8826175812 [email protected] F Ops 2100Amy Sharma 9857536898 [email protected] F Ops 2500 In this example, we do not specify any expression, hence it automatically matches a whole line, as follows: $ awk ‘$0 ~ /mail/’ emp.dat The output on execution of this code is as follows: Jack Singh 9857532312 [email protected] M hr 2000Jane Kaur 9837432312 [email protected] F hr 1800Eva Chabra 8827232115 [email protected] F lgs 2100Ana Khanna 9856422312 [email protected] F Ops 2700Victor Sharma 8826567898 [email protected] M Ops 2500John Kapur 9911556789 [email protected] M hr 2200Sam khanna 8856345512 [email protected] F lgs 2300Emily Kaur 8826175812 [email protected] F Ops 2100Amy Sharma 9857536898 [email protected] F Ops 2500 expression ~ /regexpr /: This matches if the string value of the expression contains a sub-string matched by regexpr. Generally, this left-hand operand of the matching operator is a field. For example, in the following command, we print all the lines in which the value in the second field contains a /Singh/ string: $ awk ‘$2 ~ /Singh/{ print }’ emp.dat We can also use the expression as follows: $ awk ‘{ if($2 ~ /Singh/) print}’ emp.dat The output on execution of the preceding code is as follows: Jack Singh 9857532312 [email protected] M hr 2000Hari Singh 8827255666 [email protected] M Ops 2350Ginny Singh 9857123466 [email protected] F hr 2250Vina Singh 8811776612 [email protected] F lgs 2300 expression !~ /regexpr /: This matches if the string value of the expression does not contain a sub-string matched by regexpr. Generally, this expression is also a field variable. For example, in the following example, we print all the lines that don’t contain the Singh sub-string in the second field, as follows: $ awk ‘$2 !~ /Singh/{ print }’ emp.dat The output on execution of the preceding code is as follows: Jane Kaur 9837432312 [email protected] F hr 1800Eva Chabra 8827232115 [email protected] F lgs 2100Amit Sharma 9911887766 [email protected] M lgs 2350Julie Kapur 8826234556 [email protected] F Ops 2500Ana Khanna 9856422312 [email protected] F Ops 2700Victor Sharma 8826567898 [email protected] M Ops 2500John Kapur 9911556789 [email protected] M hr 2200Billy Chabra 9911664321 [email protected] M lgs 1900Sam khanna 8856345512 [email protected] F lgs 2300Emily Kaur 8826175812 [email protected] F Ops 2100Amy Sharma 9857536898 [email protected] F Ops 2500 Any expression may be used in place of /regexpr/ in the context of ~; and !~. The expression here could also be if, while, for, and do statements. Basic regular expression construct Regular expressions are made up of two types of characters: normal text characters, called literals, and special characters, such as the asterisk (*, +, ?, .), called metacharacters. There are times when you want to match a metacharacter as a literal character. In such cases, we prefix that metacharacter with a backslash (), which is called an escape sequence. The basic regular expression construct can be summarized as follows: Here is the list of metacharacters, also known as special characters, that are used in building regular expressions: ^    $    .    [    ]    |    (    )    *    +    ? The following table lists the remaining elements that are used in building a basic regular expression, apart from the metacharacters mentioned before: Literal A literal character (non-metacharacter ), such as A, that matches itself. Escape sequence An escape sequence that matches a special symbol: for example t matches tab. Quoted metacharacter () In quoted metacharacters, we prefix metacharacter with a backslash, such as $ that matches the metacharacter literally. Anchor (^) Matches the beginning of a string. Anchor ($) Matches the end of a string. Dot (.) Matches any single character. Character classes (…) A character class [ABC] matches any one of the A, B, or C characters. Character classes may include abbreviations, such as [A-Za-z]. They match any single letter. Complemented character classes Complemented character classes [^0-9] match any character except a digit. These operators combine regular expressions into larger ones: Alternation (|) A|B matches A or B. Concatenation AB matches A immediately followed by B. Closure (*) A* matches zero or more As. Positive closure (+) A+ matches one or more As. Zero or one (?) A? matches the null string or A. Parentheses () Used for grouping regular expressions and back-referencing. Like regular expressions, (r) can be accessed using n digit in future. Do check out the book Learning AWK Programming to learn more about the intricacies of AWK programming language for text processing. Read More What is the difference between functional and object-oriented programming? What makes a programming language simple or complex?last_img

Be the first to comment on "Regular expressions in AWK programming What Why and How"

Leave a comment

Your email address will not be published.


*