Mimimising memory usage of an array-like data structure

Discussion:

Alan Burlison

12 years ago

I have an immutable, sorted list of around 1 million pairs of 32-bit
integers that I'm searching with a binary chop - the usage is simply
mapping one integer value to another. I've used a Vector of 2-tuples,
which takes (as far as I can tell) around 30-32Mb of memory, the
theoretical lower limit being around 8Mb. Performance is fine but I'm
wondering what Scala data types would have the smallest memory
overhead, e.g. Array versus Vector etc and are there any tricks
(@specialized?) that I can pull to reduce the required memory.

--
Alan Burlison
--
--
You received this message because you are subscribed to the Google Groups "scala-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-user+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.

Jason Zaugg

12 years ago

Permalink

Post by Alan Burlison
I have an immutable, sorted list of around 1 million pairs of 32-bit
integers that I'm searching with a binary chop - the usage is simply
mapping one integer value to another. I've used a Vector of 2-tuples,
which takes (as far as I can tell) around 30-32Mb of memory, the
theoretical lower limit being around 8Mb. Performance is fine but I'm
wondering what Scala data types would have the smallest memory
overhead, e.g. Array versus Vector etc and are there any tricks

All standard library collections (with the exception of Java Arrays), box
primitive values.

Memory wise, the smallest would be `Tuple2[Array[Int], Array[Int]]`.

We can dream of a JVM with "Arrays 2.0" [1] where this sort of hoisting
wouldn't be necessary.

-jason

[1]
http://www.oracle.com/technetwork/java/javase/community/jvmls2012-1840099.html

--
You received this message because you are subscribed to the Google Groups "scala-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-user+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.

Simon Ochsenreither

12 years ago

Permalink

Hi Alan,

basically what Jason said!

Depending on the JVM implementation, you could also have a look whether a
single Array[Int] with a Key-Value-Key-Value-... improves performance in
your use-case (but keep in mind it makes it easier to make mistakes with
that encoding).

One should take Arrays 2.0 with a grain of salt. It's proposed by John
Rose, so empiric evidence says that it probably won't go anywhere.

A bit off-topic, but having an efficient representation of structs is
exactly what I'm currently trying to implement in Avian.

Bye,

Simon

Erik Osheim

12 years ago

Permalink

Post by Jason Zaugg
All standard library collections (with the exception of Java Arrays), box
primitive values.
Memory wise, the smallest would be `Tuple2[Array[Int], Array[Int]]`.

You have a few other options: you can pack two Ints into a Long:

object II {
def apply(x: Int, y: Int) = new II((x.toLong << 32L) | y.toLong)
}

class II(val long: Long) extends AnyVal {
def _1: Int = ((long >>> 32L) & 0xffffffffL).toInt
def _2: Int = (long & 0xffffffffL).toInt
}

This would allow you to just use an Array[Long] instead of a tuple,
and gives you better locality when accessing both members of the pair.

If allocating one contiguous array skeeves you out, you can also
compromise And roll your own Vector with branching factor 1024 (or
whatever you want). The implementation of that is too big for this
email, but a simple version tailored for exactly 1024^2 elements
looks like this:

// starts with all elements set to 0L, corresponding to (0, 0)
object V1024 {
def empty = new V1024(Array.fill(1024)(new Array[Long](1024)))
}

// get or set elements
class V1024(arrs: Array[Array[Long]]) {
def get(i: Int): II =
new II(arrs((i >>> 10) & 1023)(i & 1023))
def set(i: Int, pair: II): Unit =
arrs((i >>> 10) & 1023)(i & 1023) = pair.long
def set(i: Int, x: Int, y: Int): Unit =
arrs((i >>> 10) & 1023)(i & 1023) = II(x, y).long
}

Hope this helps!

-- Erik

Johannes Rudolph

12 years ago

Permalink

Post by Erik Osheim
This would allow you to just use an Array[Long] instead of a tuple,
and gives you better locality when accessing both members of the pair.
If allocating one contiguous array skeeves you out, you can also
compromise And roll your own Vector with branching factor 1024 (or
whatever you want). The implementation of that is too big for this
email, but a simple version tailored for exactly 1024^2 elements

Another possibility is to write out those ints into a file and use
FileChannel.map().asIntBuffer to serve those values directly from virtual
memory with the file serving as a second-level cache managed by the OS. For
8MB it's probably not worthwhile but I used a similar technique to access
GB-sized CSV files this way.
--
Johannes

-----------------------------------------------
Johannes Rudolph
http://virtual-void.net

Alan Burlison

12 years ago

Permalink

Post by Johannes Rudolph
Another possibility is to write out those ints into a file and use
FileChannel.map().asIntBuffer to serve those values directly from virtual
memory with the file serving as a second-level cache managed by the OS. For
8MB it's probably not worthwhile but I used a similar technique to access
GB-sized CSV files this way.

Nice idea :-)

sreque

12 years ago

Permalink

There's no need to write the data to a file. See
http://mechanical-sympathy.blogspot.co.uk/2012/10/compact-off-heap-structurestuples-in.html

...