Sort and Uniq — How to Turn Noise Into Signal

\ From deduplication to frequency analysis, learn how to turn raw terminal output into actionable insight. \ Raw output lies to you. A list of 3,000 IP addresses means nothing until you know which ones appear 500 times and which ones appear once. A credential dump is useless until you know which passwords are shared across accounts. Log files tell you nothing until events are ranked by frequency. sort and uniq answer those questions. They do not find things and they do not extract things — they organize and count what you already have. In security work, that step is often where the actual insight lives. This article covers both tools from scratch: every flag worth knowing and how they fit into real security pipelines. Two Tools, One Workflow sort — takes lines of input and puts them in order. Alphabetical, numerical, reverse, by field. The ordering itself is often the goal, but sort also sets up uniq to work correctly. uniq — removes or counts duplicate lines. It only works on adjacent duplicates — consecutive lines that are identical. This is why sort almost always comes first. They are separate tools that solve separate problems, but in practice they are almost always used together. Part One — sort What sort Does sort reads lines from input and writes them back out in sorted order. The basic call: bash sort filename Or pass output from another command: bash some_command | sort By default, sort orders lines lexicographically — the same way a dictionary orders words. It compares character by character from left to right. Input: banana apple cherry date Output: apple banana cherry date That is the default. The flags are where the real control comes in. The Core Flags -n — Sort Numerically Default sort is lexicographic, which means numbers sort as strings. 10 comes before 2 because 1 comes before 2 in character order. bash sort numbers.txt Input: 10 2 30 5 Lexicographic output: 10 2 30 5 That is wrong for numbers. -n fixes it: bash sort -n numbers.txt Output: 2 5 10 30 When to use it: Any time you are sorting counts, port numbers, sizes, UIDs, or anything numeric. If the values are numbers, always use -n . -r — Reverse the Sort Order Reverses whatever order sort would normally produce. bash sort -r names.txt Alphabetical becomes reverse alphabetical. Numeric ascending becomes descending. bash sort -rn numbers.txt Combines -r and -n — numeric sort, highest first. This is the pattern you will use constantly: rank by frequency, highest count at the top. -u — Sort and Remove Duplicates -u tells sort to output only unique lines — the first occurrence of each value, duplicates discarded. bash sort -u ips.txt Sorts the list and removes any duplicate IP addresses in one step. This is a shortcut for sort | uniq when you only need deduplication and not counts. -f — Case-Insensitive Sort Treats uppercase and lowercase as equivalent when sorting. bash sort -f names.txt Admin , admin , and ADMIN sort to the same position. Without -f , uppercase letters sort before lowercase in ASCII order, which can give counterintuitive results. -k — Sort by a Specific Field By default sort uses the entire line. -k tells it to sort by a specific column. bash sort -k2 data.txt Sorts by the second whitespace-separated field. bash sort -k2 -n data.txt Sorts by the second field, numerically. The field syntax is worth knowing precisely. -k2,2 means "start at field 2, end at field 2" — sort only on that field. -k2 without an end position means "start at field 2, continue to end of line," which can produce unexpected results on lines with trailing fields. For a single-field sort, always use -k n,n . -t — Define the Field Separator for -k By default sort splits fields on whitespace. -t changes the separator so -k works on delimited data. bash sort -t':' -k3 -n /etc/passwd Split on : , sort by field 3 (the UID), numerically. Shows you users ordered by UID from lowest to highest. -h — Human-Readable Sort Sorts values that include size suffixes — K , M , G — correctly. bash du -sh * | sort -h Without -h , 10M sorts before 2G lexicographically because 1 comes before 2 . With -h , it correctly treats 2G as larger. When to use it: Sorting file sizes, disk usage output, anything with human-readable size units. -R — Randomize Order Shuffles lines into random order. Not often needed in analysis, but useful for sampling a large dataset or randomizing a wordlist. bash sort -R wordlist.txt sort in Security Workflows Sort a List of IPs Numerically bash sort -t'.' -k1,1n -k2,2n -k3,3n -k4,4n ips.txt Sorts IP addresses numerically by each octet. Each field is separated by . and sorted as a number. The result is a properly ordered IP list — not the lexicographic mess you get from a plain sort. Sort Nmap Output by Port Number bash grep "open" nmap.txt | cut -d'/' -f1 | sort -n grep filters open ports. cut extracts the port number. sort -n orders them numerically lowest to highest. Sort Files by Size bash ls -lh | sort -k5 -h List files with human-readable sizes, then sort by the size column (field 5) using human-readable sort. Order Discovered Paths Alphabetically bash grep "Status: 200" gobuster.txt | cut -d' ' -f1 | sort Clean alphabetical list of discovered paths. Easier to read and identify patterns than unsorted output. Part Two — uniq What uniq Does uniq filters adjacent duplicate lines from input. It compares each line to the one immediately before it — if they are identical, the duplicate is removed or counted depending on the flags you use. The critical point: uniq only acts on adjacent duplicates . Lines must be consecutive to be compared. This is why sort almost always comes first — it groups identical lines together so uniq can process them reliably. bash sort data.txt | uniq This is the standard pattern. sort groups duplicates together, uniq removes them. The Core Flags No Flag — Remove Duplicates bash uniq filename Removes consecutive duplicate lines. Each unique line appears once. Input (already sorted): apple apple banana cherry cherry cherry Output: apple banana cherry -c — Count Occurrences This is the flag you will use most. -c prepends a count to each line — how many times that line appeared in the input. bash sort data.txt | uniq -c Output: 2 apple 1 banana 3 cherry The count is on the left, the value on the right. Combined with sort -rn , this gives you a frequency-ranked list — one of the most useful patterns in log analysis and output parsing. bash sort data.txt | uniq -c | sort -rn Output: 3 cherry 2 apple 1 banana Highest count first. This three-command pipeline appears constantly in security work. -d — Show Only Duplicate Lines Prints only lines that appear more than once — one copy of each duplicate. bash sort usernames.txt | uniq -d Shows which usernames appear multiple times in the list — useful for finding repeated entries in a data dump or identifying reused credentials. -u — Show Only Unique Lines The opposite of -d . Prints only lines that appear exactly once — entries with no duplicates anywhere in the input. bash sort usernames.txt | uniq -u -i — Case-Insensitive Comparison Treats lines as duplicates even if they differ only in case. bash sort -f usernames.txt | uniq -i Admin , admin , and ADMIN are treated as the same line. One copy survives. -f n — Skip First n Fields Ignores the first n whitespace-separated fields when comparing lines for duplicates. bash uniq -f 1 data.txt Compares lines starting from field 2 onward. Useful when lines have a timestamp or sequence number in field 1 that you want to ignore during deduplication. -s n — Skip First n Characters Ignores the first n characters of each line when comparing. bash uniq -s 8 logfile.txt Skips the first 8 characters (often a timestamp prefix) and compares the rest of each line. uniq in Security Workflows Count and Rank IP Addresses in a Log bash grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" access.log | sort | uniq -c | sort -rn grep extracts every IP. sort groups identical IPs together. uniq -c counts each one. sort -rn ranks highest first. Output shows you exactly which IPs hit the server most — your top talkers, potential scanners, or brute-force sources at a glance. Find Duplicate Usernames in a Dump bash sort usernames.txt | uniq -d Any username appearing more than once surfaces immediately. Count Unique vs Total bash sort ips.txt | uniq | wc -l How many unique IPs are in this list? wc -l counts the lines after deduplication. Frequency Analysis on HTTP Status Codes bash cut -d' ' -f9 access.log | sort | uniq -c | sort -rn In the Combined Log Format used by Apache and Nginx, the HTTP status code sits at field 9 when the line is space-delimited. Log formats vary — verify your field position against a sample line before relying on the number. sort groups the codes. uniq -c counts each. sort -rn ranks by frequency. Output: 8423 200 1204 404 341 403 89 500 12 301 The traffic distribution, error rate, and whether something is hammering your 403s or 500s — visible in seconds. Find Unique Ports Across Multiple Scans bash cat scan1.txt scan2.txt scan3.txt | grep "open" | cut -d'/' -f1 | sort -n | uniq Combine output from multiple nmap runs, extract open ports, sort numerically, deduplicate. One clean list of every unique open port found across all scans. Detect Password Reuse in a Credential Dump bash cut -d':' -f2 creds.txt | sort | uniq -d cut extracts the password field. sort groups identical passwords. uniq -d shows only duplicates — passwords used by more than one account. Rank User-Agent Strings from Web Logs bash cut -d'"' -f6 access.log | sort | uniq -c | sort -rn | head -20 In Combined Log Format, splitting on double-quotes puts the User-Agent string at field 6. This assumes the format has not been customized — check a sample line if results look wrong. sort + uniq -c counts each distinct User-Agent. sort -rn ranks by frequency. head -20 shows the top 20. Rare or unusual User-Agent strings near the bottom often indicate scanners, custom tooling, or automated clients worth investigating. Part Three — sort and uniq Together The flags make sense individually. The power comes from chaining them. The Core Pipeline bash sort | uniq -c | sort -rn This three-step pipeline is the foundation of frequency analysis in the terminal. You will use it constantly. sort — group identical lines together uniq -c — count each group sort -rn — rank by count, highest first Everything else is just feeding different data into this pipeline. Top Attacking IPs bash grep "Failed password" /var/log/auth.log | grep -oP "(?<=from )\S+" | sort | uniq -c | sort -rn | head -10 Step by step: grep finds failed SSH login lines grep -oP extracts only the source IP using a lookbehind sort groups identical IPs together uniq -c counts each IP sort -rn ranks highest first head -10 shows only the top 10 One pipeline. Immediate visibility into your top brute-force sources. Deduplicate a Wordlist bash sort wordlist.txt | uniq > clean_wordlist.txt Sorts the list, removes duplicates, and writes to a new file. Smaller and cleaner for the next tool. Compare Two Lists — What Is in One but Not the Other bash sort list1.txt list2.txt | uniq -u This works correctly only when each value appears exactly once in each file. When you combine both files and sort them, a value present in both lists appears twice — uniq -u filters it out. A value present in only one list appears once — uniq -u keeps it. The caveat: if a value appears more than once within a single file, the count changes and the result is unreliable. For a clean set difference on well-formed lists, this pattern works. For anything with internal duplicates, use comm instead — it is purpose-built for comparing sorted files. bash comm -23 <(sort list1.txt) <(sort list2.txt) comm -23 prints only lines unique to the first file. -13 gives lines unique to the second. -12 gives lines in both. Count Unique Values in a Specific Field bash cut -d':' -f1 /etc/passwd | sort | uniq -c | sort -rn Extracts usernames, counts each one. In /etc/passwd every username should appear once — if any show up with a count greater than 1, something is wrong. Rank Error Types in Application Logs bash grep "ERROR" app.log | cut -d' ' -f5- | sort | uniq -c | sort -rn | head -15 grep filters to error lines. cut extracts everything from field 5 onward — adjust the field number to match your log format, since application log structures vary. sort + uniq -c counts each distinct error message. sort -rn ranks by frequency. head -15 shows the most common ones. Immediately separates systemic errors from one-off events. Quick Reference sort | Flag | What It Does | |----|----| | -n | Sort numerically | | -r | Reverse sort order | | -u | Sort and remove duplicates | | -f | Case-insensitive sort | | -k n | Sort by field n | | -k n,n | Sort by field n only | | -t 'x' | Use x as field separator for -k | | -h | Sort by human-readable size (K, M, G) | | -R | Randomize order | | -rn | Numeric sort, highest first (common combo) | uniq | Flag | What It Does | |----|----| | (no flag) | Remove consecutive duplicate lines | | -c | Prefix each line with its occurrence count | | -d | Show only lines that appear more than once | | -u | Show only lines that appear exactly once | | -i | Case-insensitive comparison | | -f n | Skip first n fields when comparing | | -s n | Skip first n characters when comparing | The Core Pipeline bash sort input | uniq -c | sort -rn Group → Count → Rank. Use this for any frequency analysis task. Closing sort and uniq do not find vulnerabilities. They do not generate payloads. What they do is take the raw output of every other tool you run and make it readable and actionable. Frequency analysis is one of the most underrated skills in security work. Knowing which IP sent 3,000 requests in ten minutes, which password is shared across 40 accounts, which error type fires hundreds of times per hour — that is the signal buried in the noise. One pipeline pulls it out. bash sort | uniq -c | sort -rn That is the whole idea. \

View original source — Hacker Noon ↗

ShareShare on X Share on Facebook