Everything you're usually working (or playing) with...
Wide range of capabilities
Typically 2-6 cores. 1-16G RAM, single disk, "cheaper" components
Typically 6-24 cores, 16-512G RAM, redundant storage...
Typically 2-1000s of cores, 8-1000s G of RAM, (multi)-redundant storage up to PByte range
The term "Server" can also refer to software!
"Webserver", e.g., can mean the physical machine housing the software that delivers web sites.
...or explicitely the software that does.
"Operating system".
The software that is used to interact with computer hardware, excluding any applications.
E.g. Microsoft Windows, FreeBSD, Linux, Unix...
GPL: "Do what you want with it, but include the source code."
Linux distributions are created by people who take the linux and other necessary software sources and compile something from them that (more or less) works as a whole.
Android...
To just be able to type commands on a server.
Working on a server with mouse, windows, graphical user interface.
We will work on:
bioinf1.cos.uni-heidelberg.de
On Windows, use PuTTY to log in to a server terminal.
Mac/Linux use their standard terminal applications.
Set up a session and optionally save it for later use.
Log in as kurs1(2,3-9)
When typing the password, nothing will be displayed, of course.
Use the X2Go-Client (aka "die fette Robbe") to log in graphically to "bioinf1".
Session configuration
Start a defined session.
A remote desktop.
(XFCE-Environment)
A proxy server that uses HTML5 to stream a terminal or desktop environment to a browser.
You log in to Guacamole to access the machines configured for your account.
You have to have access credentials to those machines in addition to the ones for the proxy interface.
It separates you from the innards of the computer.
It accepts your commands and delivers their results.
(=command interpreters)
We'll use Bash.
It's the standard on Linux systems.
kurs9@bioinf1:~$ _
You are user kurs9 on machine bioinf1, working in your home directory (~), with regular user rights ($).
root@erysimum:/usr/bin# _
You are user root on machine erysimum, working in directory /usr/bin), with administrative rights (#).
The user "root" is the administrative user on a Linux system.
There's literally no restriction to what root may do on (to) the system.
Which can be quite dangerous, actually.
You never work as root permanently but prefix a command with sudo if necessary and allowed.
kurs9@bioinf1:~$ history
1 touch test
2 ls -lah ..
3 echo "Juhu"
4 echo $PATH
5 history
Use the Arrow up/down keys to get back to command(line)s you already typed.
Use the history command to get a list.
kurs9@bioinf1:~$ touch curry-sausage-with-ketchup
kurs9@bioinf1:~$ touch cucumbers-are-green
kurs9@bioinf1:~$ rm cu[TAB]
cucumbers-are-green curry-sausage-with-ketchup
kurs9@bioinf1:~$ rm cuc[TAB]umbers-are-green
Use the Tab key to make the Bash try to complete what you started typing.
Use this feature to avoid tedious typing and annoying typos.
touch creates an empty file.
rm deletes a file.
Use the key combination Strg-C to cancel things, if something's running indefinitely, etc.
Don't.
Well, OK, there's copy/paste from the context menue.
Built-ins or... external programs... are often hard to distinguish.
Good that it doesn't matter for most practical purposes. :)
cp -r --interactive sourcepath targetpath
...this is the "copy" command b.t.w.
The line consists of space-separated "arguments"
The first one is the command itself.
"options" are arguments with a dash, often (optional) modifiers.
"options" can sometimes have "parameters".
"positional arguments" are typically mandatory and have to be specified in a certain order.
(Like "source" followed by "target".)
Everything in a line after "#" is ignored by the Bash.
# Hi. I'm a comment.
# I'm used for code documentation, mostly.
The manual reader.
man man # yes, looks silly. Try it.
Leave the help reader by pressing "q" (quit).
Many commands give a short help message when called with the "help" option.
cp --help
Linux files systems represent anything as files.
A drive(harddisks, USB-sticks, CDs...) has to be "mounted" into an empty directory
(a "mountpoint").
The driver in charge of that hardware then displays the content of the drive in that folder.
kurs9@bioinf1:~$ tree -L 1 -d /
/ # the root of the tree
├── bin # executable programs
├── boot # the kernel et al.
├── dev # device representations
├── etc # configuration files
├── home # where the users live
├── home2
├── lib # program libraries
├── lost+found # orphaned files and stuff
├── media # where usb is mounted
├── mnt # where harddrives are mounted
├── opt # optional software
├── proc # process representations
├── root # the admin's home
├── srv # data to serve to others
├── sys # system "files" et al.
├── tmp # temporary data
├── usr # "unix system resources"
└── var # variable data
Try the following:
kurs9@bioinf1:~$ ls /mnt
raw, work and lehre are data volumes shared on several servers.
kurs9@bioinf1:~$ ls /mnt/lehre
Here's your sequencing data (for later).
Just for fun:
kurs9@bioinf1:~$ ls /usr/bin
"Paths" are a way to pinpoint an object in the file system tree.
You can use it as an argument to a command if you want to do something with a file or folder, e.g.
Imagine the following situation:
/
├── bin
├── boot
├── etc
└── home
├── kurs1
├── kurs2
├── kurs3
├── kurs4
├── kurs5
│ ├── Bilder
│ ├── Dokumente
│ ├── Downloads
│ └── Musik
├── kurs6
└── kurs9
The "address" of the folder Musik would be:
/home/kurs5/Musik
Absolute paths are unambiguous for any object in the tree and from any position in the tree.
They always start with a /, the file system root.
Relative paths can be ambiguous.
They always start at the current position and lead to an object relative to it.
They never start with a /.
/
├── bin
├── boot
├── etc
└── home
├── kurs1
├── kurs2
├── kurs3
├── kurs4
├── kurs5
│ ├── Bilder
│ │ ├── JPG
│ │ └── SVG
│ ├── Dokumente
│ ├── Downloads
│ └── Musik
├── kurs6
└── kurs9
To reach directory JPG:
A special kind of file system object that can "hold" or contain other objects.
The primary means in any office or file system to remain organized (and keep one's sanity).
"list folder's content"
kurs9@bioinf1:~$ ls
Bilder curry-sausage-with-ketchup Downloads Öffentlich Videos
biodivc2 Dokumente Musik Schreibtisch Vorlagen
kurs9@bioinf1:~$ ls -lh # -lh = long, human-readable
total 32K
drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Bilder
-rw-r--r-- 1 kurs9 kurs 0 Feb 21 16:10 curry-sausage-with-ketchup
drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Dokumente
drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Downloads
drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Musik
drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Öffentlich
drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Schreibtisch
drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Videos
drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Vorlagen
drwxr-xr-x 2 kurs9 kurs 4,0K Feb 21 09:07 Dokumente
"Where am I? (print working dir)"
kurs9@bioinf1:~$ ll # an alias for ls -lh
total 33k
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Bilder
-rw-r--r-- 1 kurs9 kurs 0 Feb 21 16:10 curry-sausage-with-ketchup
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Dokumente
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Downloads
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Musik
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Öffentlich
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Schreibtisch
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Videos
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Vorlagen
kurs9@bioinf1:~$ pwd
/home/kurs9 # looks right...
"Go into a different folder (change dir)."
kurs9@bioinf1:~$ pwd
/home/kurs9
kurs9@bioinf1:~$ cd Dokumente/
kurs9@bioinf1:~/Dokumente$ pwd
/home/kurs9/Dokumente
kurs9@bioinf1:~/Dokumente$ cd # always to your home folder
kurs9@bioinf1:~$ pwd
/home/kurs9
Unix/Linux systems are case-sensitive!
(Windows isn't.)
kurs9@bioinf1:~$ pwd
/home/kurs9
kurs9@bioinf1:~$ PWD
PWD: command not found
kurs9@bioinf1:~$ Pwd
No command 'Pwd' found, did you mean:
Command 'xwd' from package 'x11-apps' (main)
Command 'gwd' from package 'geneweb' (universe)
Command 'pwd' from package 'coreutils' (main)
Pwd: command not found
"Make a folder"
kurs9@bioinf1:~/Dokumente$ cd Dokumente
kurs9@bioinf1:~/Dokumente$ pwd
/home/kurs9/Dokumente
kurs9@bioinf1:~/Dokumente$ mkdir ordner1
kurs9@bioinf1:~/Dokumente$ mkdir Ordner1/ordner2 # Oops!
mkdir: cannot create directory ‘Ordner1/ordner2’: No such file or directory
kurs9@bioinf1:~/Dokumente$ mkdir ordner1/ordner2
kurs9@bioinf1:~/Dokumente$ ls
ordner1
kurs9@bioinf1:~/Dokumente$ tree
.
└── ordner1
└── ordner2
2 directories, 0 files
Use the option "-p" to create all necessary subfolders and suppress errors on existing folders.
"Remove an empty folder"
kurs9@bioinf1:~$ cd
kurs9@bioinf1:~$ ls
Bilder Dokumente Downloads Musik Öffentlich Schreibtisch Videos Vorlagen
# We're here for work, not fun! ;)
kurs9@bioinf1:~$ rmdir Musik
kurs9@bioinf1:~$ ls
Bilder Dokumente Downloads Öffentlich Schreibtisch Videos Vorlagen
"Remove an EMPTY folder"
kurs9@bioinf1:~$ cd ~/Dokumente/
kurs9@bioinf1:~/Dokumente$ ls
ordner1
kurs9@bioinf1:~/Dokumente$ tree
.
└── ordner1
└── ordner2
2 directories, 0 files
kurs9@bioinf1:~/Dokumente$ rmdir ordner1
rmdir: failed to remove 'ordner1': Directory not empty
"the parent folder" (one level up)
kurs9@bioinf1:~$ pwd
/home/kurs9
kurs9@bioinf1:~$ cd ..
kurs9@bioinf1:/home$ pwd
/home
".." can be a perfectly valid part of a path.
"the current folder (exactly here)"
kurs9@bioinf1:~$ pwd
/home/kurs9
kurs9@bioinf1:~$ cd .
kurs9@bioinf1:~$ pwd
/home/kurs9 # nothing much happened...
"." can also be a perfectly valid part of a path.
To find a program or command to run, Bash looks into defined places only.
kurs9@bioinf1:~$ whereis ls
ls: /bin/ls /usr/share/man/man1/ls.1.gz
# /bin is one of those places.
To explicitly run an executable file stored in your current folder, you'd say:
./myprogram
"My ~ is my castle!"
kurs9@bioinf1:~$ cd ..
kurs9@bioinf1:/home$ pwd
/home
kurs9@bioinf1:/home$ cd ~/Dokumente/
kurs9@bioinf1:~/Dokumente$ pwd
/home/kurs9/Dokumente
"~" is just an abbreviation for /home/kurs9.
Just plain, readable, printable characters, optionally organized in lines.
Letters, figures, punctuation, line breaks, tabs.
No fancy formatting.
"American Standard Code for Information Interchange. (1963)
man ascii
2 3 4 5 6 7 30 40 50 60 70 80 90 100 110 120
------------- ---------------------------------
0: 0 @ P ` p 0: ( 2 < F P Z d n x
1: ! 1 A Q a q 1: ) 3 = G Q [ e o y
2: " 2 B R b r 2: * 4 > H R \ f p z
3: # 3 C S c s 3: ! + 5 ? I S ] g q {
4: $ 4 D T d t 4: " , 6 @ J T ^ h r |
5: % 5 E U e u 5: # - 7 A K U _ i s }
6: & 6 F V f v 6: $ . 8 B L V ` j t ~
7: ' 7 G W g w 7: % / 9 C M W a k u DEL
8: ( 8 H X h x 8: & 0 : D N X b l v
9: ) 9 I Y i y 9: ' 1 ; E O Y c m w
A: * : J Z j z
B: + ; K [ k {
C: , < L \ l |
D: - = M ] m }
E: . > N ^ n ~
F: / ? O _ o DEL
One Byte - one character.
Character coding for international languages.
From Chinese characters to Emojis. You name it.
One to several Bytes per character.
Using Umlauts, Unicode or just Spaces in filenames is an invitation to trouble.
You can do it. But it's a gurantee to complicate things.
kurs9@bioinf1:~/Dokumente$ touch some name
# a space is what separates arguments...
kurs9@bioinf1:~/Dokumente$ ll
total 4,1k
-rw-r--r-- 1 kurs9 kurs 0 Feb 22 20:16 name
-rw-r--r-- 1 kurs9 kurs 0 Feb 22 20:16 some
# two files, not one called "some name".
Anything that contains unprintable characters.
Do never open your bioinformatics files in MS Word!
Creates an empty file or changes the timestamp of an existing one.
kurs9@bioinf1:~/Dokumente$ touch some_file
kurs9@bioinf1:~/Dokumente$ ls
some_file
kurs9@bioinf1:~/Dokumente$ ll
total 0
-rw-r--r-- 1 kurs9 kurs 0 Feb 22 20:28 some_file
Remove (delete) files
kurs9@bioinf1:~/Dokumente$ ll
total 4,1k
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 22 20:29 folder1
-rw-r--r-- 1 kurs9 kurs 0 Feb 22 20:28 some_file
kurs9@bioinf1:~/Dokumente$ rm some_file
kurs9@bioinf1:~/Dokumente$ rm folder1/
rm: cannot remove 'folder1/': Is a directory
kurs9@bioinf1:~/Dokumente$ rm -r folder1/
kurs9@bioinf1:~/Dokumente$ ll
total 0
# Be VEERY careful using this one. rm -r is evil.
"concatenate". Reads and prints out files, primarily.
kurs9@bioinf1:~$ cat a_file
Lorem Ipsum is simply dummy text of the printing
and typesetting industry. Lorem Ipsum has been the
industry standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and
scrambled it to make a type specimen book.
It has survived not only five centuries,
but also the leap into electronic...
"concatenate". Reads and prints out files, primarily.
Can also be used to output more than one file. Thus the name.
kurs9@bioinf1:~$ cat a_file another_file
Lorem Ipsum is simply dummy text of the printing
and typesetting industry. Lorem Ipsum has been the
industry standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and
scrambled it to make a type specimen book.
It has survived not only five centuries,
but also the leap into electronic...
!!Now THIS is the next file.
Will print out the first couple of lines of a file. Without waiting for 5 Gbytes of text to scroll through.
kurs9@bioinf1:~$ head -1 a_file
Lorem Ipsum is simply dummy text of the printing
# just 1 line. 10 is default.
Guess...
kurs9@bioinf1:~$ tail -1 a_file
but also the leap into electronic...
Creates a link to a file.
kurs9@bioinf1:~$ touch a_file
kurs9@bioinf1:~$ ln -s a_file a_link_to_a_file
kurs9@bioinf1:~$ ll
total 29k
-rw-r--r-- 1 kurs9 kurs 0 Feb 23 00:43 a_file
lrwxrwxrwx 1 kurs9 kurs 6 Feb 23 00:44 a_link_to_a_file -> a_file
# -s means symbolic link...
Creates a copy of a file (or whole subtree).
kurs9@bioinf1:~$ cp a_file b_file
kurs9@bioinf1:~$ ls
a_file b_file
kurs9@bioinf1:~$ cp a_file Dokumente
kurs9@bioinf1:~$ ls Dokumente
a_file
kurs9@bioinf1:~$ ls
a_file b_file
# use -r to copy directories (recursive)
Moves a file to another position in the file tree.
kurs9@bioinf1:~$ touch x_file
kurs9@bioinf1:~$ ls
a_file b_file x_file
kurs9@bioinf1:~$ mv x_file Bilder
kurs9@bioinf1:~$ ls Bilder
x_file
kurs9@bioinf1:~$ ls
a_file b_file
"Moving" a file without changing its location in the file tree is kind of synonymous to renaming it.
kurs9@bioinf1:~$ ls
a_file b_file
kurs9@bioinf1:~$ mv b_file z_file
kurs9@bioinf1:~$ ls
a_file z_file
No explicit "rename" command in Bash.
...or never.
Be VEERY careful with cp, mv or any other write operation.
Linux/Bash presumes you know what you're doing. No questions asked.
Existing files are overwritten mercilessly!
The standard output channel. Usually connected to your terminal window.
The standard INput channel. Usually connected to your keyboard.
The standard error channel. Also connected to your terminal, but separated from normal output.
Redirect anything that a command prints out to a file.
kurs9@bioinf1:~$ ls > listing.txt
kurs9@bioinf1:~$ cat listing.txt
a_file
a_link_to_a_file
Bilder
Dokumente
...
Use content of a file as input for something.
kurs9@bioinf1:~$ some_command < listing.txt
# content of listing.txt is used as input
# for program "some_command".
kurs9@bioinf1:~$ cat > listing.txt
bla, bla.
kurs9@bioinf1:~$ cat listing.txt
bla, bla.
# This is a quick and dirty method to type
# something into a file. No input filename given,
# so input for cat is implicitely connected to stdin.
# Output is redirected to file listing.txt.
# Strg-D stops input and thus saves the file.
If a program explicitely outputs error or system messages besides any "normal" output, you can save that to a file, too.
kurs9@bioinf1:~$ some_cmd 2> error.log
# or even:
kurs9@bioinf1:~$ some_cmd 2> error.log > output.txt
# The cmd will create two different files.
If an output file already exists >> will append to it instead of overwriting it.
kurs9@bioinf1:~$ some_cmd 2>> error.log # e.g.
# The error log will continouosly grow.
Imagine a pipe connection from the output of one program to the input side of another one.
Use | for that purpose.
Find it on the ">/<"-key. Use Alt Gr for the third keyboard level.
echo will just output a character string.
tr will "translate" one character to another in a data stream.
kurs9@bioinf1:~$ echo xaxaxaxa
xaxaxaxa
kurs9@bioinf1:~$ echo xaxaxaxa | tr "x" "u"
uauauaua
kurs9@bioinf1:~$ echo xaxaxaxa | tr -d "x" # delete
aaaa
kurs9@bioinf1:~$ ls -1
a_file
a_link_to_a_file
Bilder
Dokumente
Downloads
kurs9@bioinf1:~$ ls -1 | tr -d "\n"
a_filea_link_to_a_fileBilderDokumenteDownloads
# "\n" is the "new line"-character.
# delete it to get a single line.
# "\t" is a tab, b.t.w.
While Unix/Linux/Mac uses "\n" as line endings, Windows insists on using two characters for the same purpose: "\r\n".
An anachronism dating back to the old "Teletex"-era. (60s, 70s?)
Keep in mind, in case that something goes "inexplainably" wrong.
A convenient way to output/explore text files.
A little text editor.
"stream editor". Find/replace automatically on data streams.
kurs9@bioinf1:~$ echo "Lorem ipsum" | sed s/"[eu]m"/"ax"/g
Lorax ipsax
# [eu] means e or u
# the "g" means globally, all, not just first hit.
# "s" means "substitute".
A pattern description language implemented in several commands.
Can you come up with a string that matches the following regular expression?
[CK]xy *[0-9][a-z]+A{2,5}B
Would Kxy3ffAB work? Why?
Filtering lines according to search patterns.
kurs9@bioinf1:~$ ll | grep "^d"
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Bilder
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 22 20:30 Dokumente
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Downloads
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Öffentlich
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Schreibtisch
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Videos
drwxr-xr-x 2 kurs9 kurs 4,1k Feb 21 09:07 Vorlagen
# only print lines beginning with "d".
# "-v" invertes ("not containing...")
# "-i" non-case-sensitive search
"Word count". Count lines, characters...
kurs9@bioinf1:~$ echo "abcde" | wc -c #characters
6 # really? count again. Why?
kurs9@bioinf1:~$ ll | wc -l # lines
13
kurs9@bioinf1:~$ echo "abc def" | wc -w # words
2
Sort lines, alphabetically , numerically (-n), reverse (-r)...
kurs9@bioinf1:~$ ls -1 | sort -r
Vorlagen
Videos
some_file
Schreibtisch
Öffentlich
listing
Downloads
Dokumente
Bilder
a_link_to_a_file
a_file
Eliminate duplicates and optionally count them (-c).
kurs9@bioinf1:~$ echo "a,u,s,b,u,h,b"| tr "," "\n" | sort | uniq -c
1 a
2 b
1 h
1 s
2 u
# only in successive lines
Add line numbers.
kurs9@bioinf1:~$ echo "pi,pa,po"| tr "," "\n" | nl
1 pi
2 pa
3 po
Find files in the file tree.
find /usr -type f -name "ls"
/usr/lib/klibc/bin/ls
# search in /usr and subtrees
# for f(iles)
# with the name "ls"
Very versatile command. Can find anything in the file tree: names, patterns, sizes, access times...
Can also call commands on its findings with the -exec option.
Dissect tabular data.
kurs9@bioinf1:~$ cat > tabular
abc;def
ghi;jkl
mno;pqr # now press Strg-D to save.
kurs9@bioinf1:~$ cat tabular
abc;def
ghi;jkl
mno;pqr
kurs9@bioinf1:~$ cat tabular | cut -d ";" -f 2
def
jkl
pqr
# -d = delimiter, -f = field number
Variables can:
kurs9@bioinf1:~$ mynumber=2
kurs9@bioinf1:~$ echo $mynumber # note the "$" character
2
kurs9@bioinf1:~$ mynumber=$mynumber*2
kurs9@bioinf1:~$ echo $mynumber
2*2 # ?!
kurs9@bioinf1:~$ let mynumber=$mynumber*2
kurs9@bioinf1:~$ echo $mynumber
8 # 2*2*2
Variable names preceded by "$" are replaced by the contents of the variable.
kurs9@bioinf1:~$ a=xy
kurs9@bioinf1:~$ echo $a
xy
kurs9@bioinf1:~$ echo $abc
kurs9@bioinf1:~$ echo ${a}bc
xybc
kurs9@bioinf1:~$ b="xy"
kurs9@bioinf1:~$ echo "b holds $b"
b holds xy # variable is interpreted
kurs9@bioinf1:~$ echo 'b holds $b'
b holds $b # $b is taken literally!
Single quotes are more "restrictive" than double quotes.
kurs9@bioinf1:~$ a=$(echo -n "abcabcabc" | tr -d "a" | wc -c)
kurs9@bioinf1:~$ echo $a
6 # why?
"$()" is evaluated to the result of the command(line) it contains.
Some variables are stored in the "environment" to make Bash work.
kurs9@bioinf1:~$ echo $PATH
/home/kurs9/bin:/home/kurs9/.local/bin:/usr/local/sbin:
/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:
/usr/local/games:/snap/bin
# PATH holds the list of paths the Bash searches for commands in.
# You could redefine it like this:
kurs9@bioinf1:~$ export PATH=$PATH:/home/kurs9/folder1
# Now there's an additonal place to look for commands in.
The construct to use if you need to iterate over lists of items.
kurs9@bioinf1:~$ for i in a b c d; # press Return here
> do echo "Do unbelievably complex bioinformatics task with sequence file ${i}.fasta";
> done # multi-line command finished.
Do unbelievably complex bioinformatics task with sequence file a.fasta
Do unbelievably complex bioinformatics task with sequence file b.fasta
Do unbelievably complex bioinformatics task with sequence file c.fasta
Do unbelievably complex bioinformatics task with sequence file d.fasta
Lists to iterate over can also be generated by commands...
kurs9@bioinf1:~$ for j in $(seq 1 1000);do echo -n $j" ";done
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 ...
"seq from to" generates sequences of numbers.
Lists to iterate over can also be generated by "wildcards"...
kurs9@bioinf1:~$ for k in D*;do echo -n $k" ";done
Dokumente Downloads
# everything that starts with a capital "D".
"*" means "filename parts of arbitrary length and syntax. a*b lists every file with a name like "arghb" or "axb", OR just "ab"!
"While some condition evaluates to true, do something. Stop if it doesn't (anymore)."
# remember "tabular"?
kurs9@bioinf1:~$ while read f; do echo $f";xyz";done < tabular
abc;def;xyz
ghi;jkl;xyz
mno;pqr;xyz
# very often used in combination with read
read var reads an element from a datastream and stores it in a variable.
A script is like a little program.
It's just a textfile holding a sequence of Bash commands that can be executed, like a new command.
Use nano to save the following into the file myscript.
#!/bin/bash
while read f
do
echo $f";xyz"
done < $1
To make it run we need to allow it to be executed.
Use chmod for this ("change mode").
kurs9@bioinf1:~$ ll myscript
-rw-r--r-- 1 kurs9 kurs 54 Feb 23 13:45 myscript
kurs9@bioinf1:~$ chmod a+x myscript # add "x" for "a"ll.
kurs9@bioinf1:~$ ll myscript
-rwxr-xr-x 1 kurs9 kurs 54 Feb 23 13:45 myscript
To run it we have to explicitely start it in our current directory.
kurs9@bioinf1:~$ ./myscript
./myscript: line 5: $1: ambiguous redirect
# "$1" is the first argument after the command name
# we have to specify an input file
kurs9@bioinf1:~$ ./myscript tabular
abc;def;xyz
ghi;jkl;xyz
mno;pqr;xyz
# great.
Files are often redundant. You could safe "aaaaaaaaa" just as well in the form "9a" which takes up much less disk space.
It's a little more sophisticated than that but in principle this is called Runlength encoding.
To compress data you can use e.g. "GNU Zip".
Will compress a file and rename it to file.gz.
Can also work on data streams.
kurs9@bioinf1:~$ ll tabular
-rw-r--r-- 1 kurs9 kurs 24 Feb 23 12:49 tabular
kurs9@bioinf1:~$ gzip tabular
kurs9@bioinf1:~$ ll tabular*
-rw-r--r-- 1 kurs9 kurs 52 Feb 23 12:49 tabular.gz
# doesn't seem to be very redundant...
for i in $(seq 1 10000);do echo -n "a";done | gzip > aaa.gz
kurs9@bioinf1:~$ ll aaa.gz
-rw-r--r-- 1 kurs9 kurs 46 Feb 23 14:01 aaa.gz
# better: 46 bytes, not 10 Kb
Will unpack a gzipped file and rename it accordingly.
kurs9@bioinf1:~$ gunzip aaa.gz
kurs9@bioinf1:~$ ll
total 62k
-rw-r--r-- 1 kurs9 kurs 10k Feb 23 14:01 aaa
You can use zcat to "cat" gzipped files without unpacking them.
kurs9@bioinf1:~$ gzip aaa
kurs9@bioinf1:~$ zcat aaa.gz
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa....
Just for completeness' sake: tar can create archives of many files or even whole subtrees and optionally zip them.
# pack folder1:
tar --remove-files -czf arch.tar.gz folder1
# unpack:
tar -xzf arch tar.gz
You can email text data easily to anywhere.
cat some_data.txt | mail -s "Look I got data"
-r test@cos.uni-heidelberg.de
mkiefer@cos.uni-heidelberg.de
cp between different computers, based on ssh.
# one file from here to there
scp datafile user1@somewhere.de:/to/this/folder
# a whole subtree from here to there
scp -r some/datafolder user1@somewhere.de:/to/this/other/folder/newname
# one file from there to somewhere else
scp user1@thismachine.de:/this/file user@thatmachine.de:/there/newname
A convenient program to use scp and sftp on any OS is Filezilla.
Very easy sequence format.
The name is derived from a very old bioinformatics software package, long forgotten. The format remains.
FASTA looks like this:
>sequence1. It is DNA. I extracted it from a plant.
tgactagcatgctactacagcgtagcatctagctacgactatctagcatcatc
acgatgtgcggcgcggtaataatagcgctaggctcgtagcagcgagaagagg
>sequence2. Whatever.
tgctacatgcgcgcgcgcgcgcgccacgacatgggcgcgcgcgcgcgcgcga
gatgctacgatcgtagcg
The format for NGS raw data. This is what an Illumina sequencer will give you.
Similar to FASTA with an additional Quality line.
Numerical quality values are encoded according to the ASCII table. See FastQ on Wikipedia.
@GGR-22:420:HN55KBCX2:2:1101:1233:2068 1:N:0
CTAGTGTCACTTGATAACGAAACTCTTTGGCATGAAAGACTAGGTCACATAAATTTT
AAGGACGTGGTGAGAGGTGTTCCTAAATTGGTTTTTAAAGAAAACATTATTTGTGGA
AGAGCCCCCCATACGAACTTAACCCACGTAGGTACAAAACGGCCTTTATAATTATTG
CGTCAAC
+
DDDDDIHHHHHHIH?CGHHCGHDDCFHGEHHIFHHIIHHHEHHHCC?FH?DHIEHHF
HH1HHC/?CECC?EC1EF?D?FHHGHF?EGECC1CEHHFCHHIHIIHHHGHIIIIHI
E1?G?G?HHDHHIHHH?E?HEHCFHHIGH.EEHCEHH..FC.BEAB?E....CF..B
,BB?BBB
@GGR-22:420:HN55KBCX2:2:1101:1173:2213 1:N:0
GAGGTT...
One Illumina read with its quality annotation and the start of the next one (of like a billion). FastQ files are usually gzipped because of their extreme size.
"Sequence Alignment / Map"
@HD VN:1.6 SO:coordinate # SO = sorting order
@SQ SN:ref LN:45 # LN = length of reference, SN = ref. name
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
##fileformat=VCFv4.0 #Mandatory!
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=< ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=< ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=< ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=< ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=< ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=< ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=< ID=q10,Description="Quality below 10">
##FILTER=< ID=s50,Description="Less than 50% of samples have data">
##FORMAT=< ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=< ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=< ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=< ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
##FORMAT=< ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=< ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=< ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=< ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.