Bash Workshop

Servers and Workstations

"Normal" Computers

Everything you're usually working (or playing) with...

  • Desktop PCs
  • Notebooks
  • Assorted mobile devices

Wide range of capabilities

Typically 2-6 cores. 1-16G RAM, single disk, "cheaper" components

Workstations

  • Usually desktop machine
  • High performance
  • Often designed for special task (Bioinformatics, CAD, Video...)
  • Designed for 24/7 uptime

Typically 6-24 cores, 16-512G RAM, redundant storage...

Servers

  • Usually installed in a rack
  • Usually higher performance
  • Often designed for special task (Computations, storage, web...)
  • Designed for 24/7 uptime
  • Accessed remotely

Typically 2-1000s of cores, 8-1000s G of RAM, (multi)-redundant storage up to PByte range

The Course Server

  • "bioinf1"
  • 2x10 cores = 40 threads
  • 200G RAM
  • 8T local storage
  • ca. 50T attached storage
  • Ubuntu Linux "Xenial Xerus"

Nomenclature confusion

The term "Server" can also refer to software!

"Webserver", e.g., can mean the physical machine housing the software that delivers web sites.

...or explicitely the software that does.

Unix and Linux

1970

Unix / Minix

  • Bell Labs, USA
  • Professional i.e. commercial operating systems (OS).
  • Intended for huge Mainframe computers, then.
  • Used, sold, improved by several companies lateron. (IBM, SGI, HP, Sun...)
  • Free version(s) from the UC Berkeley (FreeBSD...)
  • FreeBSD was also basis for Apple's OS/X.

What's an OS, actually...?

"Operating system".

The software that is used to interact with computer hardware, excluding any applications.

E.g. Microsoft Windows, FreeBSD, Linux, Unix...

1980s

GNU/HURD

  • by Richard Stallman et al.
  • Free Software.
  • "GNU's Not Unix"
  • The OS ("Hurd") never made it to usability.
  • But many other GNU parts did.

The "GNU Public License"

GPL: "Do what you want with it, but include the source code."

  • You can take GPL software, change it, distribute it, include it, sell it, base other software on it.
  • But anything that comes from it has to stay under the GPL.
  • You have to give away the GPL license text and any source code with your software in any case.
  • Widely adopted by the Linux community.

1991

Linux

  • Linus Torvalds begins work on a Unix-like OS as part of a study project.
  • Many volunteers contribute and continue to do so.
  • The Linux kernel evolves.
  • It's bundled with GNU and...
  • distributed under the GPL.

Linux

Why bother?

  • Availability of scientific software.
  • Easy access to information and documentation.
  • Extreme customizability is a philosophy with Linux. "It's all about choice."
  • Efficient working, easy adaptation.
  • No license costs.
  • More fun. ;-)

Linux

Why hate having to bother?

  • You still have to be much more into computers to get along well with linux.
  • Constant development can mean bugs find their way into the open.
  • Some things will need some tweaking.
  • Newest hardware sometimes not yet supported.
  • Some "special software" isn't available...

Linux

Flavours

Linux distributions are created by people who take the linux and other necessary software sources and compile something from them that (more or less) works as a whole.

  • Debian
  • Ubuntu
  • Mint
  • Raspbian
  • Knoppix
  • RedHat
  • Scientific Linux
  • SuSE/Novell
  • CentOS
  • Fedora

Android...

Remote Access

Why?

  • Servers usually do not have a keyboard and screen attached.
  • Many users can have access at the same time.
  • Server rooms are noisy and not a good place to work.
  • You can work from anywhere.

Remote Access

Terminal access

To just be able to type commands on a server.

Desktop access

Working on a server with mouse, windows, graphical user interface.

The Address

We will work on:

bioinf1.cos.uni-heidelberg.de

SSH

Terminal access

On Windows, use PuTTY to log in to a server terminal.
Mac/Linux use their standard terminal applications.

Set up a session and optionally save it for later use.

Log in as kurs1(2,3-9)

When typing the password, nothing will be displayed, of course.

X2Go

Desktop access

Use the X2Go-Client (aka "die fette Robbe") to log in graphically to "bioinf1".

Session configuration

Start a defined session.

A remote desktop.
(XFCE-Environment)

Shells

What's a shell?

User + apps
Shell
Kernel
Hardware

It separates you from the innards of the computer.

It accepts your commands and delivers their results.

Shells

(=command interpreters)

  • C-Shell (1978)
  • Bourne-Shell (1979)
  • Korn-Shell (1983)
  • Windows(-Explorer) (1985)
  • Bourne again-Shell (1989)=BASH

Shells

We'll use Bash.

It's the standard on Linux systems.

Bash

The prompt:


							kurs9@bioinf1:~$ _
						

You are user kurs9 on machine bioinf1, working in your home directory (~), with regular user rights ($).


							root@erysimum:/usr/bin# _
						

You are user root on machine erysimum, working in directory /usr/bin), with administrative rights (#).

The absolute dictator

The user "root" is the administrative user on a Linux system.

There's literally no restriction to what root may do on (to) the system.

Which can be quite dangerous, actually.

You never work as root permanently but prefix a command with sudo if necessary and allowed.

Little helpers

History


						kurs9@bioinf1:~$ history
						    1  touch test
						    2  ls -lah .. 
						    3  echo "Juhu"
						    4  echo $PATH
						    5  history
						

Use the Arrow up/down keys to get back to command(line)s you already typed.

Use the history command to get a list.

Little helpers

"Tab completion"


						kurs9@bioinf1:~$ touch curry-sausage-with-ketchup
						kurs9@bioinf1:~$ touch cucumbers-are-green
						kurs9@bioinf1:~$ rm cu[TAB]
						cucumbers-are-green         curry-sausage-with-ketchup  
						kurs9@bioinf1:~$ rm cuc[TAB]umbers-are-green 
						

Use the Tab key to make the Bash try to complete what you started typing.

Use this feature to avoid tedious typing and annoying typos.

touch creates an empty file.

rm deletes a file.

Little helpers

Cancel!

Use the key combination Strg-C to cancel things, if something's running indefinitely, etc.

Working with the mouse

Don't.

Well, OK, there's copy/paste from the context menue.

Commands

Shell commands

Built-ins or... external programs... are often hard to distinguish.

Good that it doesn't matter for most practical purposes. :)

Anatomy of a command


							cp -r --interactive sourcepath targetpath
						
  • cp - the command's name
  • -r - a (short) option
  • --interactive - a (long) option
  • sourcepath - a (positional) argument
  • targetpath - another argument
  • -o outfile - an option with a parameter

...this is the "copy" command b.t.w.

Anatomy of a command

The line consists of space-separated "arguments"

The first one is the command itself.

"options" are arguments with a dash, often (optional) modifiers.

"options" can sometimes have "parameters".

"positional arguments" are typically mandatory and have to be specified in a certain order.
(Like "source" followed by "target".)

Comments

Everything in a line after "#" is ignored by the Bash.


							# Hi. I'm a comment. 
							# I'm used for code documentation, mostly. 
						

Getting help

man

The manual reader.


							man man # yes, looks silly. Try it. 
						

Leave the help reader by pressing "q" (quit).

-h or --help

Many commands give a short help message when called with the "help" option.


							cp --help 
						

File system

Everything is a file

Linux files systems represent anything as files.

  • Special kinds of files:
    • Directories
    • Devices
    • System parts
    • Processes

Everything is a file

Mounting drives

A drive(harddisks, USB-sticks, CDs...) has to be "mounted" into an empty directory
(a "mountpoint").

The driver in charge of that hardware then displays the content of the drive in that folder.

The file system tree


						kurs9@bioinf1:~$ tree -L 1 -d /
						/               # the root of the tree 
						├── bin         # executable programs
						├── boot        # the kernel et al.
						├── dev         # device representations
						├── etc         # configuration files
						├── home        # where the users live 
						├── home2
						├── lib         # program libraries
						├── lost+found  # orphaned files and stuff
						├── media       # where usb is mounted
						├── mnt         # where harddrives are mounted 
						├── opt         # optional software 
						├── proc        # process representations
						├── root        # the admin's home
						├── srv         # data to serve to others
						├── sys         # system "files" et al. 
						├── tmp         # temporary data
						├── usr         # "unix system resources"
						└── var         # variable data
						

Site seeing

Try the following:


						kurs9@bioinf1:~$ ls /mnt
						

raw, work and lehre are data volumes shared on several servers.


						kurs9@bioinf1:~$ ls /mnt/lehre
						

Here's your sequencing data (for later).

Just for fun:


						kurs9@bioinf1:~$ ls /usr/bin
						

Paths

"Paths" are a way to pinpoint an object in the file system tree.

You can use it as an argument to a command if you want to do something with a file or folder, e.g.

Paths

Imagine the following situation:


						/
						├── bin
						├── boot
						├── etc
						└── home
						    ├── kurs1
						    ├── kurs2
						    ├── kurs3
						    ├── kurs4
						    ├── kurs5
						    │   ├── Bilder
						    │   ├── Dokumente
						    │   ├── Downloads
						    │   └── Musik
						    ├── kurs6
						    └── kurs9
						

The "address" of the folder Musik would be:

/home/kurs5/Musik

Paths

Absolute

Absolute paths are unambiguous for any object in the tree and from any position in the tree.

They always start with a /, the file system root.

Paths

Relative

Relative paths can be ambiguous.

They always start at the current position and lead to an object relative to it.

They never start with a /.

Paths

Examples


								/
								├── bin
								├── boot
								├── etc
								└── home
								    ├── kurs1
								    ├── kurs2
								    ├── kurs3
								    ├── kurs4
								    ├── kurs5
								    │   ├── Bilder
								    │   │    ├── JPG
								    │   │    └── SVG
								    │   ├── Dokumente
								    │   ├── Downloads
								    │   └── Musik
								    ├── kurs6
								    └── kurs9
								

To reach directory JPG:

  • "JPG"?
    Yes, from Bilder. No, from kurs5.
  • "Bilder/JPG"?
    Yes, from kurs5. No, from Bilder.
    No, from SVG.
  • "/home/kurs5/Bilder/JPG"?
    Yes, from anywhere.

Directories

Directories

A.k.a. "Folders"

A special kind of file system object that can "hold" or contain other objects.

The primary means in any office or file system to remain organized (and keep one's sanity).

Related commands

ls

"list folder's content"


						kurs9@bioinf1:~$ ls
						Bilder    curry-sausage-with-ketchup  Downloads  Öffentlich    Videos
						biodivc2  Dokumente                   Musik      Schreibtisch  Vorlagen
						

						kurs9@bioinf1:~$ ls -lh  # -lh = long, human-readable
						total 32K
						drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Bilder
						-rw-r--r-- 1 kurs9 kurs    0 Feb 21 16:10 curry-sausage-with-ketchup
						drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Dokumente
						drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Downloads
						drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Musik
						drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Öffentlich
						drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Schreibtisch
						drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Videos
						drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Vorlagen
						

ls -l output explanation


							drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Dokumente
						
  • d - it's a directory
  • drwx - owner may read, write, execute(=enter)
  • drwxr-x - owning group may read, enter but not write
  • drwxr-xr-x - all others, the same
  • kurs9 - "kurs9" owns the folder
  • kurs9 kurs - group "kurs" owns the folder
  • 4,0K - It "takes up" 4 KByte
  • Feb 21 09:07 - last modification time
  • Dokumente - finally the name

Related commands

pwd

"Where am I? (print working dir)"


						kurs9@bioinf1:~$ ll # an alias for ls -lh 
						total 33k
						drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Bilder
						-rw-r--r-- 1 kurs9 kurs    0 Feb 21 16:10 curry-sausage-with-ketchup
						drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Dokumente
						drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Downloads
						drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Musik
						drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Öffentlich
						drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Schreibtisch
						drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Videos
						drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Vorlagen
						kurs9@bioinf1:~$ pwd
						/home/kurs9 # looks right... 
						

Related commands

cd

"Go into a different folder (change dir)."


						kurs9@bioinf1:~$ pwd
						/home/kurs9
						kurs9@bioinf1:~$ cd Dokumente/
						kurs9@bioinf1:~/Dokumente$ pwd 
						/home/kurs9/Dokumente
						kurs9@bioinf1:~/Dokumente$ cd # always to your home folder
						kurs9@bioinf1:~$ pwd
						/home/kurs9
						

Case sensitivity

Unix/Linux systems are case-sensitive!
(Windows isn't.)


						kurs9@bioinf1:~$ pwd
						/home/kurs9
						kurs9@bioinf1:~$ PWD
						PWD: command not found
						

						kurs9@bioinf1:~$ Pwd
						No command 'Pwd' found, did you mean:
						 Command 'xwd' from package 'x11-apps' (main)
						 Command 'gwd' from package 'geneweb' (universe)
						 Command 'pwd' from package 'coreutils' (main)
						Pwd: command not found
						

Related commands

mkdir

"Make a folder"


							kurs9@bioinf1:~/Dokumente$ cd Dokumente
							kurs9@bioinf1:~/Dokumente$ pwd
							/home/kurs9/Dokumente
							kurs9@bioinf1:~/Dokumente$ mkdir ordner1
							kurs9@bioinf1:~/Dokumente$ mkdir Ordner1/ordner2 # Oops! 
							mkdir: cannot create directory ‘Ordner1/ordner2’: No such file or directory
							kurs9@bioinf1:~/Dokumente$ mkdir ordner1/ordner2 
							kurs9@bioinf1:~/Dokumente$ ls
							ordner1
							kurs9@bioinf1:~/Dokumente$ tree 
							.
							└── ordner1
							    └── ordner2

							2 directories, 0 files
						

Use the option "-p" to create all necessary subfolders and suppress errors on existing folders.

Related commands

rmdir

"Remove an empty folder"


							kurs9@bioinf1:~$ cd 
							kurs9@bioinf1:~$ ls
							Bilder  Dokumente  Downloads  Musik  Öffentlich  Schreibtisch  Videos  Vorlagen
							# We're here for work, not fun! ;) 
							kurs9@bioinf1:~$ rmdir Musik
							kurs9@bioinf1:~$ ls
							Bilder  Dokumente  Downloads  Öffentlich  Schreibtisch  Videos  Vorlagen
						

Related commands

rmdir

"Remove an EMPTY folder"


							kurs9@bioinf1:~$ cd ~/Dokumente/
							kurs9@bioinf1:~/Dokumente$ ls
							ordner1
							kurs9@bioinf1:~/Dokumente$ tree 
							.
							└── ordner1
							    └── ordner2

							2 directories, 0 files
							kurs9@bioinf1:~/Dokumente$ rmdir ordner1
							rmdir: failed to remove 'ordner1': Directory not empty
						

Related terms

..

"the parent folder" (one level up)


							kurs9@bioinf1:~$ pwd
							/home/kurs9
							kurs9@bioinf1:~$ cd ..
							kurs9@bioinf1:/home$ pwd
							/home
						

".." can be a perfectly valid part of a path.

Related terms

.

"the current folder (exactly here)"


							kurs9@bioinf1:~$ pwd
							/home/kurs9
							kurs9@bioinf1:~$ cd .
							kurs9@bioinf1:~$ pwd
							/home/kurs9 # nothing much happened...
						

"." can also be a perfectly valid part of a path.

What is "." good for?!

To find a program or command to run, Bash looks into defined places only.


							kurs9@bioinf1:~$ whereis ls
							ls: /bin/ls /usr/share/man/man1/ls.1.gz
							# /bin is one of those places.
						

To explicitly run an executable file stored in your current folder, you'd say:
./myprogram

Related terms

~

"My ~ is my castle!"


							kurs9@bioinf1:~$ cd .. 
							kurs9@bioinf1:/home$ pwd
							/home
							kurs9@bioinf1:/home$ cd ~/Dokumente/
							kurs9@bioinf1:~/Dokumente$ pwd
							/home/kurs9/Dokumente
						

"~" is just an abbreviation for /home/kurs9.

Text files

What is text?

Just plain, readable, printable characters, optionally organized in lines.

Letters, figures, punctuation, line breaks, tabs.

No fancy formatting.

ASCII and Unicode

ASCII

"American Standard Code for Information Interchange. (1963)


						 man ascii
					          2 3 4 5 6 7       30 40 50 60 70 80 90 100 110 120
					        -------------      ---------------------------------
					       0:   0 @ P ` p     0:    (  2  <  F  P  Z  d   n   x
					       1: ! 1 A Q a q     1:    )  3  =  G  Q  [  e   o   y
					       2: " 2 B R b r     2:    *  4  >  H  R  \  f   p   z
					       3: # 3 C S c s     3: !  +  5  ?  I  S  ]  g   q   {
					       4: $ 4 D T d t     4: "  ,  6  @  J  T  ^  h   r   |
					       5: % 5 E U e u     5: #  -  7  A  K  U  _  i   s   }
					       6: & 6 F V f v     6: $  .  8  B  L  V  `  j   t   ~
					       7: ' 7 G W g w     7: %  /  9  C  M  W  a  k   u  DEL
					       8: ( 8 H X h x     8: &  0  :  D  N  X  b  l   v
					       9: ) 9 I Y i y     9: '  1  ;  E  O  Y  c  m   w
					       A: * : J Z j z
					       B: + ; K [ k {
					       C: , < L \ l |
					       D: - = M ] m }
					       E: . > N ^ n ~
					       F: / ? O _ o DEL
						

One Byte - one character.

ASCII and Unicode

Unicode

Character coding for international languages.

From Chinese characters to Emojis. You name it.

One to several Bytes per character.

A remark about file names

Using Umlauts, Unicode or just Spaces in filenames is an invitation to trouble.

You can do it. But it's a gurantee to complicate things.


							kurs9@bioinf1:~/Dokumente$ touch some name
							# a space is what separates arguments... 
							kurs9@bioinf1:~/Dokumente$ ll
							total 4,1k
							-rw-r--r-- 1 kurs9 kurs    0 Feb 22 20:16 name
							-rw-r--r-- 1 kurs9 kurs    0 Feb 22 20:16 some
							# two files, not one called "some name". 
						

Examples for text files

  • Bioinformatics data files
  • Bash (and other) scripts
  • Simple text notes
  • Comma-separated value files (tabular data)

What isn't text...?

Anything that contains unprintable characters.

  • Images, Multimedia...
  • Compiled programs
  • Compressed archives
  • BAM alignment files
  • Word DOCs, Spreadsheets, databases...

Do never open your bioinformatics files in MS Word!

Related commands

touch

Creates an empty file or changes the timestamp of an existing one.


							kurs9@bioinf1:~/Dokumente$ touch some_file
							kurs9@bioinf1:~/Dokumente$ ls
							some_file
							kurs9@bioinf1:~/Dokumente$ ll 
							total 0
							-rw-r--r-- 1 kurs9 kurs 0 Feb 22 20:28 some_file

						

Related commands

rm

Remove (delete) files


							kurs9@bioinf1:~/Dokumente$ ll
							total 4,1k
							drwxr-xr-x 2 kurs9 kurs 4,1k Feb 22 20:29 folder1
							-rw-r--r-- 1 kurs9 kurs    0 Feb 22 20:28 some_file
							kurs9@bioinf1:~/Dokumente$ rm some_file 
							kurs9@bioinf1:~/Dokumente$ rm folder1/
							rm: cannot remove 'folder1/': Is a directory
							kurs9@bioinf1:~/Dokumente$ rm -r folder1/
							kurs9@bioinf1:~/Dokumente$ ll
							total 0
							# Be VEERY careful using this one. rm -r is evil. 
						

Related commands

cat

"concatenate". Reads and prints out files, primarily.


							kurs9@bioinf1:~$ cat a_file
							Lorem Ipsum is simply dummy text of the printing 
							and typesetting industry. Lorem Ipsum has been the
							industry standard dummy text ever since the 1500s,
							when an unknown printer took a galley of type and
							scrambled it to make a type specimen book.
							It has survived not only five centuries, 
							but also the leap into electronic...
						

Related commands

cat

"concatenate". Reads and prints out files, primarily.
Can also be used to output more than one file. Thus the name.


							kurs9@bioinf1:~$ cat a_file another_file
							Lorem Ipsum is simply dummy text of the printing 
							and typesetting industry. Lorem Ipsum has been the
							industry standard dummy text ever since the 1500s,
							when an unknown printer took a galley of type and
							scrambled it to make a type specimen book.
							It has survived not only five centuries, 
							but also the leap into electronic...
							!!Now THIS is the next file. 
						

Related commands

head

Will print out the first couple of lines of a file. Without waiting for 5 Gbytes of text to scroll through.


							kurs9@bioinf1:~$ head -1 a_file
							Lorem Ipsum is simply dummy text of the printing 
							# just 1 line. 10 is default. 
						

tail

Guess...


							kurs9@bioinf1:~$ tail -1 a_file
							but also the leap into electronic...
						

Related commands

ln

Creates a link to a file.


							kurs9@bioinf1:~$ touch a_file
							kurs9@bioinf1:~$ ln -s a_file a_link_to_a_file
							kurs9@bioinf1:~$ ll
							total 29k
							-rw-r--r-- 1 kurs9 kurs    0 Feb 23 00:43 a_file
							lrwxrwxrwx 1 kurs9 kurs    6 Feb 23 00:44 a_link_to_a_file -> a_file
							# -s means symbolic link... 
						

Related commands

cp

Creates a copy of a file (or whole subtree).


							kurs9@bioinf1:~$ cp a_file b_file
							kurs9@bioinf1:~$ ls 
							a_file b_file
							kurs9@bioinf1:~$ cp a_file Dokumente
							kurs9@bioinf1:~$ ls Dokumente
							a_file 
							kurs9@bioinf1:~$ ls
							a_file b_file
							# use -r to copy directories (recursive)
						

Related commands

mv

Moves a file to another position in the file tree.


							kurs9@bioinf1:~$ touch x_file
							kurs9@bioinf1:~$ ls 
							a_file b_file x_file 
							kurs9@bioinf1:~$ mv x_file Bilder
							kurs9@bioinf1:~$ ls Bilder
							x_file
							kurs9@bioinf1:~$ ls 
							a_file b_file 
						

Related commands

Rename files

"Moving" a file without changing its location in the file tree is kind of synonymous to renaming it.


							kurs9@bioinf1:~$ ls 
							a_file b_file
							kurs9@bioinf1:~$ mv b_file z_file 
							kurs9@bioinf1:~$ ls 
							a_file z_file
						

No explicit "rename" command in Bash.

Shoot first, ask questions later.

...or never.

Be VEERY careful with cp, mv or any other write operation.

Linux/Bash presumes you know what you're doing. No questions asked.

Existing files are overwritten mercilessly!

IO Redirection

Data channels

stdout (1)

The standard output channel. Usually connected to your terminal window.

stdin (0)

The standard INput channel. Usually connected to your keyboard.

stderr (2)

The standard error channel. Also connected to your terminal, but separated from normal output.

Output redirection

>

Redirect anything that a command prints out to a file.


							kurs9@bioinf1:~$ ls > listing.txt
							kurs9@bioinf1:~$ cat listing.txt
							a_file
							a_link_to_a_file
							Bilder
							Dokumente
							...
						

Input redirection

<

Use content of a file as input for something.


							kurs9@bioinf1:~$ some_command < listing.txt
							# content of listing.txt is used as input
							# for program "some_command". 
						

Input redirection

implicitely


							kurs9@bioinf1:~$ cat > listing.txt
							bla, bla. 
							kurs9@bioinf1:~$ cat listing.txt
							bla, bla. 
							# This is a quick and dirty method to type
							# something into a file. No input filename given, 
							# so input for cat is implicitely connected to stdin. 
							# Output is redirected to file listing.txt. 
							# Strg-D stops input and thus saves the file. 
						

Error redirection

2>

If a program explicitely outputs error or system messages besides any "normal" output, you can save that to a file, too.


							kurs9@bioinf1:~$ some_cmd 2> error.log
							# or even:
							kurs9@bioinf1:~$ some_cmd 2> error.log > output.txt
							# The cmd will create two different files. 
						

Appending redirection

>>

If an output file already exists >> will append to it instead of overwriting it.


							kurs9@bioinf1:~$ some_cmd 2>> error.log # e.g.
							# The error log will continouosly grow. 
						

Pipes

"Piping" data streams

Imagine a pipe connection from the output of one program to the input side of another one.

Use | for that purpose.

Find it on the ">/<"-key. Use Alt Gr for the third keyboard level.

echo and tr

echo will just output a character string.

tr will "translate" one character to another in a data stream.


							kurs9@bioinf1:~$ echo xaxaxaxa
							xaxaxaxa
							kurs9@bioinf1:~$ echo xaxaxaxa | tr "x" "u" 
							uauauaua
							kurs9@bioinf1:~$ echo xaxaxaxa | tr -d "x" # delete
							aaaa
						

Example


							kurs9@bioinf1:~$ ls -1 
							a_file
							a_link_to_a_file
							Bilder
							Dokumente
							Downloads
							kurs9@bioinf1:~$ ls -1 | tr -d "\n" 
							a_filea_link_to_a_fileBilderDokumenteDownloads
							# "\n" is the "new line"-character.
							# delete it to get a single line. 
							# "\t" is a tab, b.t.w.
						

Windows specialties

While Unix/Linux/Mac uses "\n" as line endings, Windows insists on using two characters for the same purpose: "\r\n".

An anachronism dating back to the old "Teletex"-era. (60s, 70s?)

Keep in mind, in case that something goes "inexplainably" wrong.

Working with files

File commands

less

A convenient way to output/explore text files.

  • q to exit program
  • / to search in the file
  • up/down keys scroll through file.
  • h help about key bindings etc.

File commands

nano

A little text editor.

  • Strg-X - exit
  • Strg-O - save file
  • Strg-K - delete one line
  • See bottom lines for the Editor commands.

File commands

sed

"stream editor". Find/replace automatically on data streams.


							kurs9@bioinf1:~$ echo "Lorem ipsum" | sed s/"[eu]m"/"ax"/g
							Lorax ipsax
							# [eu] means e or u
							# the "g" means globally, all, not just first hit. 
							# "s" means "substitute". 
						

Regular expressions

  • . any character
  • [0-9] one character of a group
  • * none or several charaters of the one preceding this
  • +one or more of the one preceding this
  • {1,3}one to three (e.g.) of the preceding character
  • ^beginning of a line
  • $end of a line

A pattern description language implemented in several commands.

Regular expressions

Can you come up with a string that matches the following regular expression?

[CK]xy *[0-9][a-z]+A{2,5}B

Would Kxy3ffAB work? Why?

File commands

grep

Filtering lines according to search patterns.


							kurs9@bioinf1:~$ ll | grep "^d" 
							drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Bilder
							drwxr-xr-x 2 kurs9 kurs 4,1k Feb 22 20:30 Dokumente
							drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Downloads
							drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Öffentlich
							drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Schreibtisch
							drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Videos
							drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Vorlagen
							# only print lines beginning with "d". 
							# "-v" invertes ("not containing...")
							# "-i" non-case-sensitive search 
						

File commands

wc

"Word count". Count lines, characters...


							kurs9@bioinf1:~$ echo "abcde" | wc -c #characters
							6 # really? count again. Why? 
							kurs9@bioinf1:~$ ll | wc -l # lines
							13
							kurs9@bioinf1:~$ echo "abc def" | wc -w # words
							2
						

File commands

sort

Sort lines, alphabetically , numerically (-n), reverse (-r)...


							kurs9@bioinf1:~$ ls -1 | sort -r 
							Vorlagen
							Videos
							some_file
							Schreibtisch
							Öffentlich
							listing
							Downloads
							Dokumente
							Bilder
							a_link_to_a_file
							a_file
						

File commands

uniq

Eliminate duplicates and optionally count them (-c).


							kurs9@bioinf1:~$ echo "a,u,s,b,u,h,b"| tr "," "\n" | sort | uniq -c 
							      1 a
							      2 b
							      1 h
							      1 s
							      2 u
							# only in successive lines
						

File commands

nl

Add line numbers.


							kurs9@bioinf1:~$ echo "pi,pa,po"| tr "," "\n" | nl
						     1	pi
						     2	pa
						     3	po
						

File commands

find

Find files in the file tree.


							find /usr -type f -name "ls"
							/usr/lib/klibc/bin/ls
							# search in /usr and subtrees
							# for f(iles)
							# with the name "ls"
						

Very versatile command. Can find anything in the file tree: names, patterns, sizes, access times...

Can also call commands on its findings with the -exec option.

File commands

cut

Dissect tabular data.


							kurs9@bioinf1:~$ cat > tabular
							abc;def
							ghi;jkl
							mno;pqr # now press Strg-D to save. 
							kurs9@bioinf1:~$ cat tabular
							abc;def
							ghi;jkl
							mno;pqr
							kurs9@bioinf1:~$ cat tabular | cut -d ";" -f 2
							def
							jkl
							pqr
							# -d = delimiter, -f = field number
						

Variables

Little labelled boxes

Variables can:

  • Store data while a program is running.
  • Get changed programmatically.
  • Organize data.
  • Be used in calculations.

Variables

Assign and calculate


							kurs9@bioinf1:~$ mynumber=2
							kurs9@bioinf1:~$ echo $mynumber # note the "$" character
							2
							kurs9@bioinf1:~$ mynumber=$mynumber*2
							kurs9@bioinf1:~$ echo $mynumber
							2*2 # ?!
							kurs9@bioinf1:~$ let mynumber=$mynumber*2
							kurs9@bioinf1:~$ echo $mynumber
							8   # 2*2*2
						

Variable names preceded by "$" are replaced by the contents of the variable.

Variables

Strings and concatenation


							kurs9@bioinf1:~$ a=xy
							kurs9@bioinf1:~$ echo $a
							xy
							kurs9@bioinf1:~$ echo $abc

							kurs9@bioinf1:~$ echo ${a}bc
							xybc
						

Strings and concatenation

Quotes


							kurs9@bioinf1:~$ b="xy" 
							kurs9@bioinf1:~$ echo "b holds $b" 
							b holds xy # variable is interpreted
							kurs9@bioinf1:~$ echo 'b holds $b'
							b holds $b # $b is taken literally! 
						

Single quotes are more "restrictive" than double quotes.

Not quite Variables

But something with a $:


							kurs9@bioinf1:~$ a=$(echo -n "abcabcabc" | tr -d "a" | wc -c)
							kurs9@bioinf1:~$ echo $a
							6 # why?
						

"$()" is evaluated to the result of the command(line) it contains.

"System variables"

Some variables are stored in the "environment" to make Bash work.


							kurs9@bioinf1:~$ echo $PATH
							/home/kurs9/bin:/home/kurs9/.local/bin:/usr/local/sbin:
							/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:
							/usr/local/games:/snap/bin
							# PATH holds the list of paths the Bash searches for commands in. 
							# You could redefine it like this:
							kurs9@bioinf1:~$ export PATH=$PATH:/home/kurs9/folder1
							# Now there's an additonal place to look for commands in. 
						

Loops

Loops

What are they good for?

  • Doing repetitive tasks
  • Processing lists of items
  • Not having to wait for one tasks before starting another.

Loops

for

The construct to use if you need to iterate over lists of items.


							kurs9@bioinf1:~$ for i in a b c d; # press Return here
							> do  echo "Do unbelievably complex bioinformatics task with sequence file ${i}.fasta";
							> done # multi-line command finished. 
							Do unbelievably complex bioinformatics task with sequence file a.fasta
							Do unbelievably complex bioinformatics task with sequence file b.fasta
							Do unbelievably complex bioinformatics task with sequence file c.fasta
							Do unbelievably complex bioinformatics task with sequence file d.fasta
						

Loops

for

Lists to iterate over can also be generated by commands...


							kurs9@bioinf1:~$ for j in $(seq 1 1000);do echo -n $j" ";done 
							1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
							25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 
							52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 
							79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103  ...
						

"seq from to" generates sequences of numbers.

Loops

for

Lists to iterate over can also be generated by "wildcards"...


							kurs9@bioinf1:~$ for k in D*;do echo -n $k" ";done
							Dokumente Downloads
							# everything that starts with a capital "D". 
						

"*" means "filename parts of arbitrary length and syntax. a*b lists every file with a name like "arghb" or "axb", OR just "ab"!

Loops

while

"While some condition evaluates to true, do something. Stop if it doesn't (anymore)."


							# remember "tabular"? 
							kurs9@bioinf1:~$ while read f; do echo $f";xyz";done < tabular 
							abc;def;xyz
							ghi;jkl;xyz
							mno;pqr;xyz
							# very often used in combination with read
						

read var reads an element from a datastream and stores it in a variable.

Scripting

A script

A script is like a little program.

It's just a textfile holding a sequence of Bash commands that can be executed, like a new command.

Use nano to save the following into the file myscript.


							#!/bin/bash
							while read f
							do
								echo $f";xyz"
							done < $1 
						

A script

To make it run we need to allow it to be executed.

Use chmod for this ("change mode").


							kurs9@bioinf1:~$ ll myscript 
							-rw-r--r-- 1 kurs9 kurs 54 Feb 23 13:45 myscript

							kurs9@bioinf1:~$ chmod a+x myscript # add "x" for "a"ll. 
							kurs9@bioinf1:~$ ll myscript
							-rwxr-xr-x 1 kurs9 kurs   54 Feb 23 13:45 myscript
						

A script

To run it we have to explicitely start it in our current directory.


							kurs9@bioinf1:~$ ./myscript 
							./myscript: line 5: $1: ambiguous redirect
							# "$1" is the first argument after the command name
							# we have to specify an input file

							kurs9@bioinf1:~$ ./myscript tabular 
							abc;def;xyz
							ghi;jkl;xyz
							mno;pqr;xyz
							# great. 
						

Packaging

Reducing redundancy

Files are often redundant. You could safe "aaaaaaaaa" just as well in the form "9a" which takes up much less disk space.

It's a little more sophisticated than that but in principle this is called Runlength encoding.

To compress data you can use e.g. "GNU Zip".

Compression

gzip

Will compress a file and rename it to file.gz.

Can also work on data streams.


							kurs9@bioinf1:~$ ll tabular
							-rw-r--r-- 1 kurs9 kurs   24 Feb 23 12:49 tabular
							kurs9@bioinf1:~$ gzip tabular 
							kurs9@bioinf1:~$ ll tabular*
							-rw-r--r-- 1 kurs9 kurs   52 Feb 23 12:49 tabular.gz
							# doesn't seem to be very redundant... 
						

							for i in $(seq 1 10000);do echo -n "a";done | gzip > aaa.gz
							kurs9@bioinf1:~$ ll aaa.gz
							-rw-r--r-- 1 kurs9 kurs   46 Feb 23 14:01 aaa.gz
							# better: 46 bytes, not 10 Kb 
						

Decompression

gunzip

Will unpack a gzipped file and rename it accordingly.


							kurs9@bioinf1:~$ gunzip aaa.gz
							kurs9@bioinf1:~$ ll
							total 62k
							-rw-r--r-- 1 kurs9 kurs  10k Feb 23 14:01 aaa
						

You can use zcat to "cat" gzipped files without unpacking them.


							kurs9@bioinf1:~$ gzip aaa
							kurs9@bioinf1:~$ zcat aaa.gz 
							aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa....
						

(Tape-) Archives

tar

Just for completeness' sake: tar can create archives of many files or even whole subtrees and optionally zip them.


							# pack folder1:
							tar --remove-files -czf arch.tar.gz folder1 
							# unpack: 
							tar -xzf arch tar.gz 
						

File transfer

Email

You can email text data easily to anywhere.


							cat some_data.txt | mail -s "Look I got data" 
							-r test@cos.uni-heidelberg.de 
							mkiefer@cos.uni-heidelberg.de
						
  • -s a subject line
  • -r a "FROM" line
  • Destination address as last argument

SCP

"Secure copy"

cp between different computers, based on ssh.


							# one file from here to there
							scp datafile user1@somewhere.de:/to/this/folder

							# a whole subtree from here to there
							scp -r some/datafolder user1@somewhere.de:/to/this/other/folder/newname

							# one file from there to somewhere else
							scp user1@thismachine.de:/this/file user@thatmachine.de:/there/newname
						

A convenient program to use scp and sftp on any OS is Filezilla.

FASTA and FASTQ

FASTA-Sequences

Very easy sequence format.

The name is derived from a very old bioinformatics software package, long forgotten. The format remains.

FASTA

FASTA looks like this:


							>sequence1. It is DNA. I extracted it from a plant. 
							tgactagcatgctactacagcgtagcatctagctacgactatctagcatcatc
							acgatgtgcggcgcggtaataatagcgctaggctcgtagcagcgagaagagg
							>sequence2. Whatever.
							tgctacatgcgcgcgcgcgcgcgccacgacatgggcgcgcgcgcgcgcgcga
							gatgctacgatcgtagcg
						

FASTQ-Sequence reads

The format for NGS raw data. This is what an Illumina sequencer will give you.

Similar to FASTA with an additional Quality line.

Numerical quality values are encoded according to the ASCII table. See FastQ on Wikipedia.

FASTQ


							@GGR-22:420:HN55KBCX2:2:1101:1233:2068 1:N:0
							CTAGTGTCACTTGATAACGAAACTCTTTGGCATGAAAGACTAGGTCACATAAATTTT
							AAGGACGTGGTGAGAGGTGTTCCTAAATTGGTTTTTAAAGAAAACATTATTTGTGGA
							AGAGCCCCCCATACGAACTTAACCCACGTAGGTACAAAACGGCCTTTATAATTATTG
							CGTCAAC
							+
							DDDDDIHHHHHHIH?CGHHCGHDDCFHGEHHIFHHIIHHHEHHHCC?FH?DHIEHHF
							HH1HHC/?CECC?EC1EF?D?FHHGHF?EGECC1CEHHFCHHIHIIHHHGHIIIIHI
							E1?G?G?HHDHHIHHH?E?HEHCFHHIGH.EEHCEHH..FC.BEAB?E....CF..B
							,BB?BBB
							@GGR-22:420:HN55KBCX2:2:1101:1173:2213 1:N:0
							GAGGTT...
						

One Illumina read with its quality annotation and the start of the next one (of like a billion). FastQ files are usually gzipped because of their extreme size.

SAM and BAM

SAM

"Sequence Alignment / Map"

  • Widely used sequence format for NGS data.
  • Result file of many assemblers.
  • Separated into header and data part
  • Usually holds alignment of reads to a reference.
  • Large!

SAM format


					@HD VN:1.6 SO:coordinate  # SO = sorting order
					@SQ SN:ref LN:45          # LN = length of reference, SN = ref. name
					r001   99 ref  7 30 8M2I4M1D3M = 37  39 TTAGATAAAGGATACTG *
					r002    0 ref  9 30 3S6M1P1I4M *  0   0 AAAAGATAAGGATA    *
					r003    0 ref  9 30 5S6M       *  0   0 GCCTAAGCTAA       * SA:Z:ref,29,-,6H5M,17,0;
					r004    0 ref 16 30 6M14N5M    *  0   0 ATAGCTTCAGC       *
					r003 2064 ref 29 17 6H5M       *  0   0 TAGGC             * SA:Z:ref,9,+,5S6M,30,1;
					r001  147 ref 37 30 9M         =  7 -39 CAGCGGCAT         * NM:i:1
						
  • "@" is part of the header lines.
  • "r001" seq. name
  • "ref" reference name
  • "7" base position in ref.
  • "30" quality
  • "8M2I4M1D3M" "CIGAR" (8 match, 2 insert...)

BAM

  • Just like SAM, but compressed. ("binary sam")
  • Takes up way less space on your harddisk.
  • Many programs can work with it without decompressing.
  • (A little less) Large!

VCF

VCF

Variant Call File

VCF


						##fileformat=VCFv4.0 #Mandatory!
						##fileDate=20090805  
						##source=myImputationProgramV3.1
						##reference=1000GenomesPilot-NCBI36
						##phasing=partial
						##INFO=< ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
						##INFO=< ID=DP,Number=1,Type=Integer,Description="Total Depth">
						##INFO=< ID=AF,Number=.,Type=Float,Description="Allele Frequency">
						##INFO=< ID=AA,Number=1,Type=String,Description="Ancestral Allele">
						##INFO=< ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
						##INFO=< ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
						##FILTER=< ID=q10,Description="Quality below 10">
						##FILTER=< ID=s50,Description="Less than 50% of samples have data">
						##FORMAT=< ID=GT,Number=1,Type=String,Description="Genotype">
						##FORMAT=< ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
						##FORMAT=< ID=DP,Number=1,Type=Integer,Description="Read Depth">
						##FORMAT=< ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
						#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
						20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
						20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
						20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
						20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
						20     1234567 microsat1 GTCT   G,GTACT 50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3
					
  • "##" means "header line
  • "##INFO" describes fields that are included in the data lines below
  • "##FILTER" describes possible filter rules applied to the data
  • "##FORMAT" describes genotype fields included in the data
  • "#" means "column heads"

VCF data

  • "CHROM": Chromosome(number)
  • "POS" ...ition in bp on reference
  • "ID": unique variant ID(s), e.g. DB identifiers
  • "REF": [AGTCN] on reference
  • "ALT": alternate allels, comma sperated
  • "QUAL": Phred score for "ALT"
  • "FILTER": "PASSED" or name of failed filter(s)
  • "INFO": data for one of the defined info fields
  • "FORMAT": format for the genotype columns to the right
  • "NA00001": column name for genotype data

VCF genotypes


						##FORMAT=< ID=GT,Number=1,Type=String,Description="Genotype">
						##FORMAT=< ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
						##FORMAT=< ID=DP,Number=1,Type=Integer,Description="Read Depth">
						##FORMAT=< ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
						

							GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
						
  • "0/0" unphased alleles
  • "0|0" phased alleles
  • "0|1" e.g. allele 1 ref. and allele 2 first alt.
  • "1/2" e.g. one allele is alt. 1 another is alt. 2