Data, and Data Types

Data, and Data Types#

Information about the world#

When we look at the world, we can see things like buildings, trees, rivers, and rocks. We can watch birds flying in the sky, fish swimming in a lake, cars and buses driving past on a road. We can hear the sounds of flowing water, branches rustling in the wind, waves rolling pebbles on a beach. We can smell the salty air near the sea, or the smoky air near a fire. We can feel the warmth or cold in the air or the water.

We can appreciate these things, enjoy them - or not, in some cases. Depending on what we see, hear, smell, or feel, we can make choices, take decisions, or do actions. That’s because everything we can see, hear, smell, or feel is information about the world. It’s data. How we feel about it, or what we do about it - that’s because our brains are taking that data, perhaps combining it with other information that we already know, and processing it.

We can see a bus driving down the road, and decide that now is really not a good time to start crossing the road. We can hear the sound of raindrops hitting windows, and put a coat on before opening the door to go outside. We can smell burning food, and rush to the cooker to pull a pot off the heat ^[1]. We can see a ball flying towards us, calculate where and when it’s going to be close to us, and put out a hand to catch it.

All of this, we often take for granted, but it’s really astonishing just how powerful our brains are, and how much information they can process so quickly.

That includes information about what’s happening in our environment. We can identify different kinds of rocks, and even use small changes to figure out the geological history of an area - even over hundreds of millions of years: that’s how the geological timescale was first constructed back in the 1800s. We can see a cloudless sky, feel the warmth in the air, and remember smelling smoke in the air from a forest on a previous similar day. We can identify species of animals and plants, hear their calls, see signs of disease. We can write these observations down, track them over days, weeks, months, years and longer figure out what weather conditions make wildfires more likely, and where. To record how populations of plants and animals are affected as towns and cities grow, or as farming practices change. To notice that we’re having to plant crops a couple of weeks later these days, that October isn’t really as cold as it used to be, or that we’re having more storms per year than we did in the past. With just our own senses, we can discover an astonishing amount about the natural world - and we have.

[1] ^ I’m better at geography than cooking.

Sensors#

However astonishing it might be though, our senses have limits. We can see red, green, and blue light, anything in between, or any combinations; but we can’t see infrared or X-rays, or any objects smaller than about 0.1mm, and we can’t say exactly how tall a tree or building is. We can’t hear sounds below 20Hz or above 20kHz. We can’t smell how much carbon dioxide is in the air - and even things we can smell, like smoke, only gives us a very rough estimate of how much we’re smelling. We can’t even feel if something is wet (really! When we feel something and think it’s wet, we’re actually just judging that based on texture and temperature, not the wetness itself).

We’re also limited by where we are. We can only see what’s in front of us, as far as the horizon; closer, if something’s in the way. We can hear from all directions, but not very far. We might be only just able to make out what a friend 180m away is saying, but no further - and even closer if it’s noisy around us. We can’t smell something if the wind is blowing it away from us. We can only feel what we can physically touch. And that’s assuming all of our senses are working in perfect condition.

But we’re not limited by our own senses. We can feel the cold, hear the rain, and see the branches swaying in the wind - and use a thermometer to measure the temperature, a gauge to measure the rainfall, and an anemometer and vane to measure the wind speed and direction. Also a barometer to measure the air pressure, a laser rangefinder to measure the height of the tree - and the same elsewhere to see how these all change over distance. We can measure the carbon dioxide, sulphur dioxide, particulate pollution in the air; track how water level rises and falls in rivers after certain amounts of rainfall in different places; use infrared cameras and more to see what our eyes can’t, and put them on satellites to see more at once than we ever possibly could.

Advances in technology and computing over the past couple of decades has brought us to a point where the level of information about the natural world we can collect as data makes the word “astonishing” seem just… too ordinary. Sensors for recording various kinds of data can now be mass produced with costs cheap enough to put them wherever they might be useful. Individual sensors for measurements like temperature, air pressure, humidity can cost just pennies. They need to be combined with electronics to power and read them, but even so, instruments which were once only available to professional organisations with significant budgets can now be easily obtained by almost everyone. Want your own weather station? You can buy a solar powered one that will wirelessly upload all its data to a free online dashboard for less than €100. A seismograph to monitor earthquakes can cost as little as €160. Even where the costs of equipment are still only within the budgets of large organisations - satellites, for example - the high speed internet means that anyone and everyone can have the same access to their data as the professionals.

Geospatial data#

Crucially, for information about the environment, this has included advances in determining where things are. For thousands of years, people have been making maps, but the launch of the GPS satellite network starting in 1978 fundamentally changed everything - and the falling costs have long since made it available to everyone.

Now, almost all of us have smartphones in our pockets which can locate where we are on the surface of the Earth to within 5 metres - and show us a satellite image and map of what’s around us. Professional equipment can get that down to just one centimeter or even better - and that’s only ‘professional’ equipment because most people don’t need it, not because it’s too expensive.

Because it’s now so cheap and easy to record a location, we don’t just have data about the natural world: we have geospatial data. It’s not just that a thermometer can measure the temperature; we can also now easily record precisely where and when that temperature was recorded, and map that with similar readings in many different places over long periods of time - along with whatever other data we want.

We’re not really limited anymore by the technology. We’re pretty much at a point where almost anyone can collect or find geospatial data for almost anything they want to. To put it simply, in terms of what we might want to know about the natural world, it’s not about what geospatial information we’re able to record anymore - it’s about what geospatial information we want to record.

Processing#

Just like our senses, our brains have limits too, and this amount of data is just far too much for our brains to process. Spot the patterns in pressure, temperature, and humidity changes? Maybe, but if you have one reading per second, and several years of data? From multiple locations? It’s just too much.

That’s where computers can come in - and what modern computers can do is almost overwhelming. The computer I’m writing this on can do something in the range of 76,728,000,000 calculations per second. Whatever data can be collected by sensors or people, computers can analyse faster and in much more detail than any person ever could.

Just like with data collection, it’s also not really about what’s possible to do - sure, there’s some limits there, but those limits are now so far away that it’s really about what you want to do.

For all of you taking this module, there will be different aspects of Geography that you’re more or less interested in. Some might have more interest in whether particular places will be at risk of flooding in the future, while others might have more interest in looking at biodiversity of places, or how buildings and roads might be changing temperature patterns through the urban heat island effect, or whether existing bus routes are enough to serve the population living in particular places. I’m sure some of you will be thinking “I’m not interested in any of those”, but will have completely different aspects of Geography you’re interested in. That’s fine, and that’s not meant to be a complete list.

So it would be a bit of a challenge to cover everything in the remaining weeks of the module, to put it mildly - so, rather than trying to teach you how to do everything, my main aim for this part of the module is to show you what is possible.

Geospatial data analysis means using computers to process geolocated data in order to better understand what is happening in the world. There’s two aspects to this.

The first aspect is that computers need to be told what to do - which we call programming. This can be done by writing instructions in a form the computer can understand - which we’d call a programming language. Of course, a lot of the time, the instructions have already been written out and packaged by other people into a computer programme - also known as software, or as an app. To read this, you had to at least open a browser and click on a link. That means you used a basic set of instructions (which control how all the various components of the computer work together) called an operating system. You used this to turn on the computer, log in, and run a second set of instructions for how to communicate with other computers over the internet, and how to display the data received - that’s your web browser, maybe Chrome or Firefox or Edge. One of those bits of data received was a link to this page, and the programme instructions told the computer to get the data for this page from where it’s stored, and how to show that on your screen - the size of the text, the colour of the background, and much more.

For analysis of geospatial data, we can write instructions ourselves in a programming language, such as Python, Julia, Rust, or R. Or, we can use instructions previously written by other people with software such as QGIS or ArcGIS. There’s even instructions which form web pages you can use in a browser, like Felt, Earth Engine, and of course Colab - which is what you’re using now. These are your tools. Which tool you need or want to use depends on what you want to do, and how you want to do it. For the main exercise for this module, you can choose to use QGIS or ArcGIS Pro or Python, depending on whichever you prefer.

The second aspect is that the data needs to be in a form which the computer can understand and use. This generally means giving the data to the computer in a file or format which matches a format already in its instructions - just like how Word understands how to open and display a written document, while Excel knows how to open and display a spreadsheet.

However, it’s also about the different things that computers can do with different kinds of data.

In order to understand how best to do that, it’s important to first understand how different information about the real world can be represented as different types of data - and how computers handle data of different types. So, that’s what this Notebook will explore.

This Notebook, as all the notebooks in this module, uses the Python programming language, but much the same will apply to other programming languages and software.

1. Computers handle text and numbers differently.#

With numbers, if we try to add 2 + 2, we get 4.

2 + 2

We know what numbers are, of course, we take that for granted. But how does a computer know? A keyboard has a bunch of different keys, each with a different squiggle on it. The computer doesn’t see the keyboard, even - when you press a letter or number key, it sends a particular message to the computer to display the relevant squiggle on the screen.

It’s basic enough for software to have instructions for the computer on which of these signals or squiggles represent numbers, and which represent letters or symbols - but at the end of the day, they’re all just squiggles. The proper term would be characters. We have a set of characters that we use to count - the numbers - and a set of characters we use to read and write - the letters.

But it’s not completely that simple. If we want to write what we’ve counted, we can use the number characters; but we can also use letters to spell out the numbers. Sometimes, letter characters can even be used as parts of numbers. Other times, we’ll use the number characters as if they’re letters or symbols, rather than for counting.

This means that there’s times when the number characters can be treated as letters.

'2' + '2'

'22'

This isn’t a mathematical error - it’s no different than if we tried:

"two" + "two"

'twotwo'

A series of text characters like this is referred to as a string in computing. Strings will be common enough in geospatial data, particularly for labels - for example placenames, species names of plants or animals, and categories of roads or paths.

For string data, Python “adds” them by simply putting them together, as you see above. Hence 2 + 2 = 22: it’s just putting the characters together.

You might be thinking “OK, but it’s easy enough to remember to type a number without putting quotes around it, since I don’t usually do that”.

Sure, if you’re typing it out, that’s easy.

But remember that when you’re trying to process environmental data, you’re not usually typing the data - you’re usually bringing in a dataset generated by sensors, or from other sources, and that data isn’t always perfect. It’s important to make sure that if you want the data to be able to be processed as numbers, it’s stored in a form which will be recognised as numbers; and if you want the data to be able to be processed as a string of text characters, it’s stored in a form which can be recognised as a string of text characters.

Here’s a quick example using the pandas library (we’ll cover what I mean by a library, and what pandas is, a bit later; first, we need to import it to use it).

import pandas

If we import data from a spreadsheet or some other data source which is all numbers, that’s fine.

pandas.Series([1, 2, 3, 4])

  1
  2
  3
  4
dtype: int64

dtype is ‘data type’. The int part of int64 means the data is being treated as an integer - a whole number, without any decimal places, like 1, 2, 3, and 4. That’s the correct data type for this set of values.

Integers are also a common data type in geospatial data. For example, you’d use integers for anything you’d want to count, like the number of floors in a building, the number of people living in an area, or the number of bicycles passing a particular point over a certain time interval. You would also use integers for anything which can be put in an ordered sequence, for example the stream order of rivers. Some sensors will also produce certain values as integers, for example if they’re recording parts per million concentrations or the like where decimal places wouldn’t be meaningful given the accuracy of the sensor.

But, say the sensor wasn’t able to get a reading once, and so one of the values has an error.

pandas.Series([1, 2, 'NA', 4])

   1
   2
  NA
   4
dtype: object

Now, dtype: object is telling you that the data is being treated as objects. In this case, it means it knows there’s different kinds of data there - some numbers and some text. For large datasets, one of the problems this creates is that the data type for each entry has to be stored separately, which takes up a lot more memory.

More importantly, if you wanted to do some processing on the data, it would work in some cases, but not all cases. For example, if you wanted to take an average of a couple of readings - well, that’s going to produce an error.

pandas.Series([1, 2, 'NA', 4]).mean()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 pandas.Series([1, 2, 'NA', 4]).mean()

File ~/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/pandas/core/series.py:6549, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
   6541 @doc(make_doc("mean", ndim=1))
   6542 def mean(
   6543     self,
   (...)
   6547     **kwargs,
   6548 ):
-> 6549     return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)

File ~/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/pandas/core/generic.py:12420, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  12413 def mean(
  12414     self,
  12415     axis: Axis | None = 0,
   (...)
  12418     **kwargs,
  12419 ) -> Series | float:
> 12420     return self._stat_function(
  12421         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  12422     )

File ~/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/pandas/core/generic.py:12377, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  12373 nv.validate_func(name, (), kwargs)
  12375 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 12377 return self._reduce(
  12378     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  12379 )

File ~/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/pandas/core/series.py:6457, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   6452     # GH#47500 - change to TypeError to match other methods
   6453     raise TypeError(
   6454         f"Series.{name} does not allow {kwd_name}={numeric_only} "
   6455         "with non-numeric dtypes."
   6456     )
-> 6457 return op(delegate, skipna=skipna, **kwds)

File ~/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    145         result = alt(values, axis=axis, skipna=skipna, **kwds)
    146 else:
--> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
    149 return result

File ~/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    401 if datetimelike and mask is None:
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:
    407     result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)

File ~/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/pandas/core/nanops.py:719, in nanmean(values, axis, skipna, mask)
    716     dtype_count = dtype
    718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 719 the_sum = values.sum(axis, dtype=dtype_sum)
    720 the_sum = _ensure_numeric(the_sum)
    722 if axis is not None and getattr(the_sum, "ndim", False):

File ~/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/numpy/_core/_methods.py:52, in _sum(a, axis, dtype, out, keepdims, initial, where)
     50 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
     51          initial=_NoValue, where=True):
---> 52     return umr_sum(a, axis, dtype, out, keepdims, initial, where)

TypeError: unsupported operand type(s) for +: 'int' and 'str'

TypeError: unsupported operand type(s) for +: 'int' and 'str' means it can’t calculate an average, because it can’t add the numbers and the text.

type("2")

str

str means string, the data type for text characters. As a number, we would get a different data type:

type(2)

int

int is the integer data type.

Now, here we come to our second distinction:

2. Not all numbers are created equal#

Integers are whole numbers. These are stored in computer memory differently than numbers with decimal places.

type(2.0)

float

float is a floating point number, a number which does have a decimal point and numbers after it. These also take up more memory than whole number integers - so if our numbers don’t have anything after the decimal point, it’s going to be more efficient and faster to store and work with them as integers. That’s not an issue when we only have a couple of numbers, but if we’re working with huge datasets, it can make a sizeable difference. You wouldn’t want to store the population of every small area of the Census of Population as a floating point number - counts are better as integers, and having them as floats will vastly increase the size of the data, for zero gain.

pandas.Series([1, 2.0, 3, 4])

  1.0
  2.0
  3.0
  4.0
dtype: float64

Even having just one floating point value here means that all the values are stored as floating point values. Be careful with this - you can bloat your datasets and slow everything down with just one stray decimal place. Floats aren’t a bad data type though, it’s just about using the right type for particular data.

Like integers, floats are a very common type for geospatial data. Almost anything that is measured will be stored as a float - distances, speeds, and temperatures, to name but a few.

There is a bit more about floats below, but for now let’s move on.

3. Some numbers represent particular things#

When does 15+15 not equal 30?

When it’s February.

A quick example, using the datetime module which tells Python how to handle dates and times. First we need to import it:

import datetime

Now let’s set a date of 15 February 2024, by giving it numbers in the order Year, Month, Day:

datetime.date(2024,2,15)

datetime.date(2024, 2, 15)

Now, let’s add 15 days to that, by adding a timedelta (a period of time) of 15 days:

datetime.date(2024,2,15) + datetime.timedelta(days=15)

datetime.date(2024, 3, 1)

2024, 3, 1 is, of course, the 1st of March 2024. Obviously, it doesn’t produce a date of 30th February - because the datetime module knows that there’s no such date. Dates, and times as well, work differently to normal numbers.

Dates and times can be quite awkward - as you’ll probably have noticed if you’ve used Excel much. Which brings up another issue - not all languages and software handle dates and times in the same way. Excel has one way, Python has another, and there’s also some particular Python modules which have their own alternative formats. It’s just something to be mindful of.

Dates and times are one example, but not the only example, of cases where numbers don’t come in isolation. A particularly relevant example in our case is the coordinates we use to record locations. For that, we need to consider other data types.

4. Tuples, lists, and dictionaries#

A single number as a coordinate is not very useful without the other coordinate. We’ll get more into coordinate systems later, but for cases like these, Python has a data type called a tuple.

Tuples are groups of multiple values. In Python, they’re denoted by round brackets.

mytuple = (5, 8)
type(mytuple)

tuple

Those two numbers are now stored together. Tuples are good for coordinates, which of course need at least two numbers (e.g. latitude and longitude, or north and east). Keeping them together as one data object just keeps things simpler.

Tuples can have more than two values as well - I only use two as the example, because that’s what you’d have for coordinates.

An important aspect of the tuple data type is that tuples are unchangeable. If we try to change one of those numbers, it won’t work.

mytuple[1] = 6

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 mytuple[1] = 6

TypeError: 'tuple' object does not support item assignment

Because they’re unchangeable, you can’t accidentally modify them, which does come in handy at times.

We’ll come back to coordinates later. But sometimes you’ll need a group of multiple values which you can change, and for that we have lists.

mylist = [5, 8]
type(mylist)

list

If we try to change a value in a list like we failed to do for the tuple, this time it will work.

mylist[1] = 9
mylist

[5, 9]

Again, lists can have more than two values - and neither tuples nor lists are limited to integers. You can have tuples to store first names, middle names, and surnames; you can even create a List of people coming to an event, where the data in the list is the tuples containing first names, middle names, and surnames for everyone.

In geospatial data, you might have a list for all the species of plants observed in a particular area, or the type of vehicles allowed on a particular road.

And sometimes you’ll want to store a group of multiple values with references to what the numbers represent - key:value pairs. For that, Python has dictionaries.

mydictionary = {"Temperature": 15, "Pressure": 1016, "Humidity": 83}
mydictionary

{'Temperature': 15, 'Pressure': 1016, 'Humidity': 83}

Dictionaries allow you to reference a value by its key:

mydictionary["Temperature"]

You’ll get used to these different data types if and when you need them, but it’s at least good to be aware of these different kinds of data from the start.

5. Booleans#

One last type it’s worth being aware of. Sometimes you don’t need text or numbers, you just need to indicate a straight choice. Let’s demonstrate by returning to our first example:

2 == 2.0

True

This checks if the integer 2 is equal to the floating point number 2.0. Of course it is, so this returns True.

2 == "2"

False

This checks if the integer 2 is equal to the text character 2. Because they’re different types of data, this isn’t true. One is a number, the other is essentially a squiggle that just happens to be identical to the squiggle for the number. Since it’s not true, this returns False.

The True and False here are what we refer to as Boolean values. These are commonly stored as 1 for True, and 0 for False. Data can be stored in this format, for example, from checkboxes. In geospatial data, you might use booleans to indicate if something is present or absent; for example, 1 if a parking space is an e-car charging point, 0 if not.

6. The more complicated bits and bytes#

Numbers in a computer are not stored as the actual number. Computers store all information in short term memory or on a hard disk where each spot can have one of two values - just like a Boolean. Usually, these are considered as 1 or 0. These are binary numbers.

(In Python, we can write numbers in binary if we prefix them with 0b, which I’ll use in the next couple of examples.)

A binary number with one digit can give us only two values.

0b0

0b1

So far, so obvious - binary 0 is 0, and binary 1 is 1. But counting up to 1 won’t get us very far.

If we put two of these binary characters together, using two digits, we can represent up to 3.

0b00

0b01

0b10

0b11

Binary 00 is 0, binary 01 is 1, binary 10 is 2, and binary 11 is 3.

Hence the old computer science joke, “there are 10 types of people in the world: those who understand binary, and those who don’t.”

Counting up to 3 isn’t much better than counting up to 1, but we can just keep adding more characters to the binary if we want bigger numbers.

If we go up to eight digits, that’s enough to get us up to:

0b11111111

This is a bit oversimplified but: inside a computer, those single points in memory which can represent either 0 or 1 can be grouped together like this to store larger numbers.

One point in memory is a bit. A group of eight of these points or bits is a byte. Because 8 binary digits can count up to 255, that means that a group of eight memory points or bits forming a byte can store number values between 0-255.

If you want negative numbers, then you can use one of the eight bits to represent the + or - sign, and the remaining 7 bits for the number.

-0b1111111

-127

Some early computers had 8-bit processors - meaning they could handle 8 bits, or one byte, at a time: 0-255, or -127 to +127. This didn’t mean they couldn’t process larger numbers; parts of the memory could be combined to store larger values. But it was a limit, and has left a lasting impact in the definition of many common file types, particularly for image files.

For example, GIF images use one byte per colour, and so can only show 256 colours. JPG and PNG images use three bytes per colour - one byte each for red, green, and blue, for a total of 24 bits per colour - allowing 16,777,216 different colours, which is more than our eyes can distinguish.

From the 1990s, when home computers became common, most had 32-bit processors.

0b11111111111111111111111111111111

4294967295

That’s a much larger number, and allows a lot more, but one limit on that was still the references to the points in memory. 32-bit computers could only access that many bits of memory. 4,294,967,295 bytes is 4 gigabytes - so 32-bit computers could only use 4Gb of RAM.

As computers became more powerful, more memory was needed, and now, most computers (like the one you’re using) are generally 64-bit.

0b1111111111111111111111111111111111111111111111111111111111111111

18446744073709551615

That will probably last a while. 18,446,744,073,709,551,615 bytes is 18.4 exabytes, and nobody needs 18.4 exabytes of RAM in their computer. That would be enough to hold the entire sum of human knowledge produced from the dawn of civilisation to the end of the 20th century in memory at the same time. We’re not going to be moving to 128-bit computers any time soon.

The broader point here is that numbers are usually stored by computers as 8-bit, 16-bit, 32-bit, or 64-bit. We saw an example in part 1 above, our pandas series:

pandas.Series([1, 2, 3, 4])

  1
  2
  3
  4
dtype: int64

When we looked at this first, I said only that the int part of int64 indicated that the data was stored as integers. The 64 bit of int64 means that these numbers are being stored as a 64-bit values: the computer has allocated 64 bits of memory to hold each integer. That allows numbers up to 18,446,744,073,709,551,614, or between -9,223,372,036,854,775,807 and +9,223,372,036,854,775,807 if you need to allow for negative numbers. That’ll do for most cases, I suspect!

You might be wondering at this point why I’m getting quite so far into the nitty gritty details here. Part of my reason is simply so that when you see something like int32 or int64, or float32 or float64, you won’t be left wondering what that means.

But there’s another quirk to this as well, and it’s about the float32s and float64s. You might have noticed that I only used integers in the examples for this section so far. How can we represent a floating point number in binary?

bin(0.1)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[34], line 1
----> 1 bin(0.1)

TypeError: 'float' object cannot be interpreted as an integer

You can’t, not directly.

How computers handle this is to approximate floating point numbers as binary fractions and exponents. Exactly how this works is beyond the scope here - you can look up floating point arithmetic if you want to know more - but what you do need to know is, this means that what you see is not always what you get.

Take the number 0.1. Seems simple? It’s short, not complicated, it shouldn’t be a problem, right?

0.1

0.1

Except, it turns out that 0.1 actually can’t be represented accurately in this format. It might look fine there, but if we show more decimal places, we can see that 0.1 isn’t actually 0.1.

"{:.24f}".format(0.1)

'0.100000000000000005551115'

This just tells Python to show 0.1 to 24 decimal places - and while you’d expect all 24 should be zeros, they aren’t.

This can produce some very unexpected results:

0.1*3 == 0.3

False

0.1 multiplied by 3 should equal 0.3 - but because of the approximation, three times the value stored by a computer to represent 0.1 is not equal to the value stored by the computer to represent 0.3.

"{:.24f}".format(0.1*3)

'0.300000000000000044408921'

"{:.24f}".format(0.3)

'0.299999999999999988897770'

"{:.24f}".format(0.1*3 - 0.3)

'0.000000000000000055511151'

This isn’t a bug in Python, it’s simply a result of how computers work, and will be the same in any software on any computer.

In most cases, there’s nothing you can or should do about this - but you absolutely should be aware of it, and mindful of it, so that you don’t get any unexpected results - or at least so that you understand it when you do.

7. Hexadecimal#

Aside from decimal and binary, there’s one other number system commonly used by computers, and that’s hexadecimal, which is a base 16 number system. ‘Base’ in this context means ‘how many different symbols are used to represent numbers’.

Binary is base 2, using two symbols: 0 or 1.

Our normal decimal numbers are base 10, with 10 symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9.

Hex, as base 16, needs more symbols for numbers than we have number symbols. Right at the top of this notebook, I said that sometimes letter characters can be used as parts of numbers - and this is what I had in mind when I wrote that. Hex uses the letters a, b, c, d, e, and f to represent additional numbers: a = 10, b = 11, c = 12, d = 13, e = 14, and f = 15.

(In Python, the prefix 0x is used to represent hex numbers).

0xf == 15

True

You’ll see this most commonly in HTML code for websites, where colours are represented by hex values for RGB (red, green, and blue). These are commonly written as a # followed by 6 digits, e.g. #000000 is black, #ffffff is white, #00ff00 is green, and #f8ed62 would be a shade of yellow.

You might have realised why (but don’t worry if you haven’t, this stuff is not very intuitive):

0xff

Two hex characters can represent 256 different values - just the same as eight bits, or one byte. For this reason, bytes are sometimes represented by two hex characters.

For this reason, hex might come up in other contexts, so best to be aware of it. Some environmental sensors I work with, for example, output their raw values as hex bytes, which means the code I had to write to process the values had to convert that to integers. It’s generally an example of “when you have to deal with it, you figure out how”, but it’s good to be aware of it in case it comes up.

8. Summary#

You won’t usually have to deal with the hex bytes or binary bits, but you will have to deal with data which comes in different kinds. You should now have a handle on the concept that computers store and treat data in different ways - text characters (strings, str), integers (whole numbers, int), decimals (floating point numbers, float), dates and times, and combinations like tuples, lists, and dictionaries.

It’s not essential to memorise all of this right at the start. Through the internet, we have access to virtually the entirety of human knowledge through the smartphones in our pockets; we can look up the details when we need to. The important part is knowing the general concept.

GY4006 Notebooks in Colab:

Data Types
Vector Data
Attribute Data
Coordinate Reference Systems
Geospatial Data Files
Vector Geoprocessing
Introduction to Raster Data
Single-band Raster Data
Multi-band Raster Data: Passive Remote Sensing