8 Data Manipulation
This guide covers essential commands for handling text data in files, including searching, sorting, and saving command outputs. These tools are invaluable for data analysis and scripting in Unix-like environments.
8.1 Basic Text Search
The grep command is a versatile tool for searching text within files:
Basic Search: Find lines containing a specific word in a file.
grep "myword" file.txtCase Insensitive: Ignore case distinctions.
grep -i "myword" file.txtLine Numbers: Show line numbers of matching lines.
grep -n "myword" file.txtInverse Search: Display lines that do not contain the specified word.
grep -v "myword" file.txtRecursive Search: Search within a directory and all its subdirectories.
grep -r "myword" ./mydirectory/
8.2 Regular Expressions
Regular expressions enhance grep’s searching capabilities. Use the -E option for extended regex support:
.: Matches any single character.^: Matches the start of a line.$: Matches the end of a line.[ ]: Matches any character within the brackets.?: The preceding element is optional.*: The preceding element can appear zero or more times.+: The preceding element must appear one or more times.|: Logical OR operator between expressions.
Example: Search for multiple words.
grep -E 'word1|word2|word3' myfile.txt8.3 Sorting File Lines
Ascending Order: Default sorting.
sort myfile.txtDescending Order: Reverse the sort order.
sort -r myfile.txtRandom Order: Shuffle lines.
sort -R myfile.txtNumeric Sort: Treat comparisons as numerical.
sort -n myfile.txtSaving Output: Use
-oto save the sorted result to a file.sort -o sorted_file.txt myfile.txt
8.4 Counting Text Elements
Basic Count: Displays line, word, and byte counts.
wc myfile.txtLines Only: Count the number of lines.
wc -l myfile.txtWords Only: Count the number of words.
wc -w myfile.txtBytes Only: Count the number of bytes.
wc -c myfile.txtCharacters Only: Count the number of characters.
wc -m myfile.txt
8.5 Removing Duplicates with uniq
Basic Usage: Filter out adjacent duplicate lines.
uniq myfile.txtSaving Output: Redirect the output to a new file.
uniq myfile.txt > result.txtCount Occurrences: Prefix lines by their occurrence counts.
uniq -c myfile.txtShow Duplicates Only: Display only the repeated lines.
uniq -d myfile.txt
8.6 Extracting Columns with cut
For files with delimited columns, cut allows you to extract specific fields:
- Specify Delimiter: Use
-dto define the column delimiter. - Select Columns:
-fselects the columns to extract.
Examples:
# Extract columns 1 to 3
cut -d ',' -f 1-3 myfile.txt
# Extract from column 3 onwards
cut -d ',' -f 3- myfile.txt8.7 Redirection and Pipes
Standard Output to File (
>): Create or overwrite a file with the command output.grep "myword" myfile.txt > result.txtAppend to File (
>>): Add the command output to the end of an existing file.grep "myword" myfile.txt >> result.txtStandard Error to File (
2>): Redirect error messages to a file.grep "myword" myfile.txt 2> error.logCombine Output and Errors (
2>&1): Direct both standard output and errors to
the same file. bash grep "myword" myfile.txt > result.txt 2>&1
Pipes (
|): Use the output of one command as input to another.grep "myword" myfile.txt | sort
8.8 Viewing File Contents
To display the contents of a file directly in the terminal:
cat myfile.txtThis command prints the entire content of myfile.txt to the screen.
8.9 Interactive Terminal Input
For interactive input, especially useful for commands like sort, you can use the here document syntax:
sort -n << ENDAfter executing this command, you can type in the words or lines you wish to sort. Each line you enter will be considered for sorting. Once you’re done, type END to indicate the completion of input and perform the sorting operation.
8.10 Conclusion
These commands form the foundation of text processing and data manipulation in Unix-like systems, enabling efficient analysis and transformation of data.