Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define Multidimensional Array Type #18

Open
pparke opened this issue Apr 3, 2024 · 10 comments
Open

Define Multidimensional Array Type #18

pparke opened this issue Apr 3, 2024 · 10 comments

Comments

@pparke
Copy link
Contributor

pparke commented Apr 3, 2024

It seems that most of the anys are due to passing around arrays of various shapes. I think we can narrow down the type to either 1 or 2 dimensional arrays of numbers. Something like MNArray = number[] | number[][] might do. Are there any situations where arrays would be more deeply nested?

I'm curious if the arrays also need to hold class instances, there are several places where there is a check like typeof a === "object" but it mostly looks like this is being used to check if a is an array? If it is a check for an array it would be better to use Array.isArray(a).

I could make a PR for those changes if my assumptions turn out to be correct. Just let me know.

@eduardoleao052
Copy link
Owner

Good points. So, in order:

  • The MDArrays can go deeper than 2 layers. In theory, they are infinitely "deep". @medic-code made a suggestion to create a recursive data Type, which is already implemented in utils.ts and defined in src/types/utils.types.ts.
  • As far as typeof a === "object" goes, you are 100% correct. a instanceof Array or Array.isArray(a) are the correct way to do it. These were just mistakes, and I've tried to correct some already (I'll probably correct the rest soon, just a simple CTRL+f can find every case).

Thanks for pointing it out!

@medic-code
Copy link
Contributor

  1. Yep - sounds like we should have this as a type not just used in utils (maybe worth redefining the file name), I had a sense this type may be more broadly used.
  2. Agree on `Array.isArray(a), the narrower type guard seems sensible.

@mgcrea
Copy link

mgcrea commented Apr 3, 2024

Would be best to support an arbitrary number of dimensions using generics, a bit like what @tensorflow/tfjs-core does with a generic tensor + aliases.

/** @doclink Tensor */
export type Scalar = Tensor<Rank.R0>;
/** @doclink Tensor */
export type Tensor1D = Tensor<Rank.R1>;
/** @doclink Tensor */
export type Tensor2D = Tensor<Rank.R2>;

Another useful Typescript helper type I use often when doing ml in js is the Tuple type:

That lets you properly cast tuples with an arbitrary dimension, eg. Tuple<number, 4> => [number, number, number, number] vs. a less useful number[].

export type Tuple<T, N extends number> = N extends N
  ? number extends N
    ? T[]
    : _TupleOf<T, N, []>
  : never;

type _TupleOf<T, N extends number, R extends unknown[]> = R["length"] extends N
  ? R
  : _TupleOf<T, N, [T, ...R]>;

Hope it's not off-topic! Pretty excited to see where this goes!

Will be happy to test/bench this against @u4/opencv4nodejs or @tensorflow/tfjs-node that I both currently use on a project.

@eduardoleao052
Copy link
Owner

@mgcrea That's true, Tuples are a really useful data type in ML. Thanks for pointing it out! About the arbitrary dimension number, it could also work, but for normal ML and NLP applications, I believe we would need to go to at least Rank 4.

@pparke
Copy link
Contributor Author

pparke commented Apr 4, 2024

@mgcrea @eduardoleao052 I made a first attempt at adding those types in this PR #21 there are still a few type errors left that require some careful thought. The added if statements could be replaced with assureArray if it is possible to update that so it doesn't return null but I don't know if some of the library logic is dependant on it returning null when null is passed in.

@eduardoleao052
Copy link
Owner

I'll take a look and let you know, but pretty sure it's not necessary. Thanks again!

@kgryte
Copy link

kgryte commented Apr 5, 2024

Rather than defining your own multi-dimensional array type, one option you could consider is directly using, and building on, existing primitives already available in other libraries. E.g., @stdlib/ndarray. In principle, you could subclass and append various operators/methods as you do now, but you'd also allow for interoperation with other libraries in the ecosystem (e.g., all of @stdlib/stdlib). At least for CPU-based computation, that would allow you to leverage (when evaluated in eager mode) stdlib's ndarray kernels both in JavaScript and, in theory, C, through stdlib's native bindings.

@eduardoleao052
Copy link
Owner

@kgryte Thanks for the tip! Really appreciate it. I'm just initially against adding bindings and necessary dependencies, especially as the library is currently completely independent and "from scratch". However, it is still possible to draw inspiration from the implementations on ndarray. Even though the applications are different and a lot of changes must be made, There is still much value in learning from their implementation. I'll surely take a look. Thanks again for the tip!

@kgryte
Copy link

kgryte commented Apr 7, 2024

@eduardoleao052 Fair enough. If there isn't the goal of ecosystem interoperation, then creating a custom type for use in a walled garden is certainly sensible. This stated, if interoperation is a goal, then conforming to existing data structure interfaces is easier to do at the beginning, rather than much farther down the line when significant refactorings would then be required. I should note that standardization and interoperation is the principle goal of the Consortium for Python Data API Standards, of which PyTorch is part. And there, we've intentionally standardized a minimal array type, similar to what is present in @stdlib/stdlib, in order to reduce ecosystem fragmentation.

@eduardoleao052
Copy link
Owner

@kgryte that makes sense. I think interoperation could bring immense benefits to the library and its usability. I'll study in the next few days how I could do that in a way that works with the other libraries in the ecosystem!
Thanks for the tip!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants