This document explains the basics of working with tabular data, and shows how to add, remove data from a table.
Populate table with data:
from tap import *
# create table with two columns, x and y both of float type
tab=Tab(['x', 'y'], 'ff')
for x in range(1000):
tab.add_row([x, x**2])
# create a plot
plt=tab.plot('x', 'y', save='x-vs-y.png')
Iterating over table items:
# load table from file
tab=load(...)
# iterate over all rows
for row in tab.rows:
# print complete row
print row
for f in tab.foo:
print f
# iterate over all rows of selected columns
for foo, bar in tab.zip('foo','bar'):
print foo, bar
Tables can either be initialized with information from memory, or populated with data from load, e.g. CSV files or a pickled dump of a previously constructed table. An empty table can be easily constructed as follows
tab=Tab()
If you want to add columns directly when creating the table, column names and column types can be specified as follows
tab=Tab(['nameX','nameY','nameZ'], 'sfb')
this will create three columns called nameX, nameY and nameZ of type string, float and bool, respectively. When the second argument is omitted, the columns will all have a string type. There will be no data in the table and thus, the table will not contain any rows.
If you want to add data to the table in addition, use the following:
tab=Tab(['nameX','nameY','nameZ'],
'sfb',
nameX=['a','b','c'],
nameY=[0.1, 1.2, 3.414],
nameZ=[True, False, False])
If values for one column are omitted, they will be filled with NA, but if values are specified, all values must be specified (i.e. same number of values per column).
tab = Tab(['name, age'], 'string,float')
Table columns have a specific type, e.g. string, float etc. Each cell in a column must either be of that type or set to not available (None). As a result, a float column can’t contain string values. The following column types exist:
long name | abbreviation |
---|---|
string | s |
float | f |
int | i |
bool | b |
When adding new data to the table, values are automatically coerced (forced) to the column type. When coercing fails, a ValueError is thrown.
The column types can be specified when initialing a new table. For convenience, several different formats are supported, which allow to specify the column types as strings, or list using long, or abbreviated forms. The following 5 examples initialise an empty table with a string, float, int and bool column each.
# abbreviated, compact form
tab = Tab(['x', 'y', 'z', 'u'], 'sfib')
# abbreviated, separated by coma
tab = Tab(['x', 'y', 'z', 'u'], 's, f, i, b')
# extended separated by coma
tab = Tab(['x', 'y', 'z', 'u'], 'string, float, int, bool')
# list abbreviated
tab = Tab(['x', 'y', 'z', 'u'], ['s', 'f', 'i', 'b'])
# list extended
tab = Tab(['x', 'y', 'z', 'u'], ['string', 'float', 'int', 'bool'])
For the lazy, the table supports guessing the column type from data when initialising a new table. The detection of column types tries to convert each value to a particular type, e.g. int. When the type conversion is not successful for any value, the column type is set to string. As a special case, when the data arrays are empty, the array types are set to string.
# initialises a table with an bool and int column
t = Tab(['x','y'], x='True False False'.split(), y='1 NA 3'.split())
print t.col_types # bool int
The following methods allow to add and remove data in a row and column-wise manner.
Add a row to the table.
data may either be a dictionary or a list-like object:
- If data is a dictionary the keys in the dictionary must match the column names. Columns not found in the dict will be initialized to None. If the dict contains list-like objects, multiple rows will be added, if the number of items in all list-like objects is the same, otherwise a ValueError is raised.
- If data is a list-like object, the row is initialized from the values in data. The number of items in data must match the number of columns in the table. A ValuerError is raised otherwise. The values are added in the order specified in the list, thus, the order of the data must match the columns.
If overwrite is not None and set to an existing column name, the specified column in the table is searched for the first occurrence of a value matching the value of the column with the same name in the dictionary. If a matching value is found, the row is overwritten with the dictionary. If no matching row is found, a new row is appended to the table.
Parameters: |
|
---|---|
Raises : | ValueError if list-like object is used and number of items does not match number of columns in table. |
Raises : | ValueError if dict is used and multiple rows are added but the number of data items is different for different columns. |
Example: add multiple data rows to a subset of columns using a dictionary
# create table with three float columns
tab = Tab(['x','y','z'], 'fff')
# add rows from dict
data = {'x': [1.2, 1.6], 'z': [1.6, 5.3]}
tab.add_row(data)
print tab
'''
will produce the table
==== ==== ====
x y z
==== ==== ====
1.20 NA 1.60
1.60 NA 5.30
==== ==== ====
'''
# overwrite the row with x=1.2 and add row with x=1.9
data = {'x': [1.2, 1.9], 'z': [7.9, 3.5]}
tab.add_row(data, overwrite='x')
print tab
'''
will produce the table
==== ==== ====
x y z
==== ==== ====
1.20 NA 7.90
1.60 NA 5.30
1.90 NA 3.50
==== ==== ====
'''
Add a column to the right of the table.
Parameters: |
|
---|
Example:
tab=Tab(['x'], 'f', x=range(5))
tab.add_col('even', 'bool', itertools.cycle([True, False]))
print tab
'''
will produce the table
==== ====
x even
==== ====
0 True
1 False
2 True
3 False
4 True
==== ====
'''
If data is a constant instead of an iterable object, it’s value will be written into each row:
tab=Tab(['x'], 'f', x=range(5))
tab.add_col('num', 'i', 1)
print tab
'''
will produce the table
==== ====
x num
==== ====
0 1
1 1
2 1
3 1
4 1
==== ====
'''
As a special case, if there are no previous rows, and data is not None, rows are added for every item in data.
Remove column with the given name from the table
Parameters: | col (str) – name of column to remove |
---|
Rename column old_name to new_name.
Parameters: |
|
---|---|
Raises : | ValueError when old_name is not a valid column |
Append each row of tab to the current table. The data is appended based on the column names, thus the order of the table columns is not relevant, only the header names.
If there is a column in tab that is not present in the current table, it is added to the current table and filled with None for all the rows present in the current table.
If the type of any column in tab is not the same as in the current table a TypeError is raised.
If overwrite is not None and set to an existing column name, the specified column in the table is searched for the first occurrence of a value matching the value of the column with the same name in the dictionary. If a matching value is found, the row is overwritten with the dictionary. If no matching row is found, a new row is appended to the table.
Returns a new table containing the data from both tables. The rows are combined based on the common values in the column(s) by. The option ‘by’ can be a list of column names. When this is the case, merging is based on multiple columns. For example, the two tables below
x | y |
---|---|
1 | 10 |
2 | 15 |
3 | 20 |
x | u |
---|---|
1 | 100 |
3 | 200 |
4 | 400 |
when merged by column x, it produces the following output:
x | y | u |
---|---|---|
1 | 10 | 100 |
2 | 15 | None |
3 | 20 | 200 |
4 | None | 400 |
Convert the table into a string representation.
The output format can be modified for int and float type columns by specifying a formatting string for the parameters ‘float_format’ and ‘int_format’.
The option ‘rows’ specify the range of rows to be printed. The parameter must be a type that supports indexing (e.g. a list) containing the start and end row index, e.g. [start_row_idx, end_row_idx].
Parameters: |
|
---|
Performs an in-place sort of the table, based on column by.
Parameters: |
|
---|
Checks if a table is empty.
If no column name is specified, the whole table is checked for being empty, whereas if a column name is specified, only this column is checked.
By default, all NAN (or None) values are ignored, and thus, a table containing only NAN values is considered as empty. By specifying the option ignore_nan=False, NAN values are counted as ‘normal’ values.
Extract a list of all unique values from one column
Parameters: |
|
---|
Checks if the column with a given name is present in the table.
The data in a table can be iterated row and column-wise.
Allows to conveniently iterate over a selection of columns, e.g.
tab=Tab.load('...')
for col1, col2 in tab.zip('col1', 'col2'):
print col1, col2
is a shortcut for
tab=Tab.load('...')
for col1, col2 in zip(tab['col1'], tab['col2']):
print col1, col2
Same as zip(), but only returns rows where none of the values is None.
Returns a filtered table only containing rows matching all the predicates in kwargs and args For example,
tab.filter(town='Basel')
will return all the rows where the value of the column “town” is equal to “Basel”. Several predicates may be combined, i.e.
tab.filter(town='Basel', male=True)
will return the rows with “town” equal to “Basel” and “male” equal to true. args are unary callables returning true if the row should be included in the result and false if not.
Returns a list of column names matching the regex
Parameters: | regex (str) – regex pattern |
---|---|
Returns: | list of column names (str) |