Just wrapped up the haxe conference in Seattle – lots of fun.
There was not much to report for hxcpp, so I did a bit of a retrospective.
Here are the slides – not too much detail, so you will need to check out the video for the full story.
New Technology, Old Programmer
Just wrapped up the haxe conference in Seattle – lots of fun.
There was not much to report for hxcpp, so I did a bit of a retrospective.
Here are the slides – not too much detail, so you will need to check out the video for the full story.
You may have noticed that this site has moved from gamehaxe.com to hughsando.com. The main reason for this is the reflect a broader set of interests – not just games and haxes, but also robotics, machine learning, AI and whatever other tech that may be floating around. The new site also matches my github handle, where you can find me at https://github.com/hughsando.
I’m still using haxe/hxcpp/nme – in fact, more so recently – just not in a gaming context, so development will continue on these fantastic projects.
I’m looking forward to a slight change in direction and hope the next ten years hosted on this site will be as good as the last ten on the previous site.
Wow, 10 years. We have seen a lot in the last decade. Rise of mobile, fall of flash and the rise of html5. And Hxcpp and Nme still manage to survive and adapt.
Nme just had a new release, 6.0.58. And it has seen some big changes. This release completes the move away from shipping static libraries, and instead ships Acadnme hosts for the desktop platforms. This allow you to test quickly with the cppia target without needing a c++ compile. This target now includes just-in-time compiling and runs at a decent speed. The windows target now uses the “angle” library by default for improved performance though a DirectX emulated OpenglES layer.
The latest release contains the reemergence of the javascript browser target. The approach here is quite different from the original ‘jeash’ – it uses almost exactly the same code base as the c++ target by compiling the core Nme library to a asm.js library, and then calling these compiled Nme functions from normal javascript. This javascript is then bundled with the assets into a single “.nme” file for uploading – you can also add cppia byte code to allow the same .nme file to be run anywhere – desktop, mobile, browser.
The .nme file is kept very small by splitting most of the nme classes and asm.js into separate files, which can be shared and cached between apps.
I have revived the original 1000 Ogre demo. The original was in flash, but is now in html5. The preloader is a bit dodgy, but I will work on that.
You can also check out the ubiquitous BunnyMark demo running under the new system.
You can try your own apps with the latest nme and the:
nme jsprime
command.
The cppia and jsprime targets are still quite new, but they both are looking super promising. Let me know over at gitter.im how you go!
I have released my WWX 2016 Hxcpp Slides.
You can find the source code on github.
CFFI (‘C Foreign Function Interface’) is a way of linking native code into a non-native VM, in this case Hxcpp.
From the very beginning, Hxcpp supported a CFFI that was very similar to the neko CFFI. This was no accident, because it allowed me to make good use the neko library code.
The actual header file for the Hxcpp CFFI is quite different from the neko one – mainly because the neko one assumes a particular memory layout for the ‘value’ type, while the hxcpp one treats the ‘value’ type as opaque. This allows code written against the Hxcpp header file to be used by neko and Hxcpp, while the neko header restricts the implementation to neko.
All function arguments and return values in the original version of CFFI are of type “value”. The ‘value’ type is implemented with a pointer. So to pass, say, an Int or String to a CFFI function, it must be “boxed” by allocating a wrapper object. This allows anything to be passed, but incurs an overhead of allocation, and must be extracted via a function like “val_int”.
To solve this problem, “CFFI Prime” was written last year (“prime” because the macro “DEFINE_PRIM” (short for “define primitive”) became “DEFINE_PRIME” (short for “define awesomeness”) ).
As far as writing the c++ interface code, this simply offers syntax sugar to convert the ‘value’ types to the appropriate c++ types using the power of c++ typing. eg, compare the classic:
value sum(value a, value b) { if( !val_is_int(a) || !val_is_int(b) ) return val_null; return alloc_int(val_int(a) + val_int(b)); } DEFINE_PRIM(sum,2); // function sum with 2 arguments
with:
int addInts(int a, int b) { return a+b; } DEFINE_PRIME2(addInts);
Both implementations create a function that can be dynamically loaded and called by neko or Hxcpp, but the second one us just easier to write. If you want to use a “complex” type (one that is not int, float, double, bool) then you still use “value” and the associated value functions as before. String has its own special type, HxString, in CFFI Prime, and “const char *” can be used.
The Prime code has a second advantage – it exposes the actual function pointer (in this case “addInts”) which can be loaded into one of Hxcpp’s native “cpp.Function” objects. This can then be called directly from Hxcpp giving super-fast and no-boxing native calls. You still need to use the “value” type to pass complex objects, since this decouples the implementation of these objects from the API for accessing them. This is how the same native code can be used by multiple run times.
The function pointers are exposed by exporting two symbols for each “DEFINE_PRIM” define – one just for hxcpp, and one for general use. To load the function, you use a java-style signature in haxe, which gets converted by a haxe macro to perform two operations: 1, to type-check the signature against the signature generated by the “DEFINE_PRIME” macro, and 2, to correctly type the cpp.Function pointer in haxe code, so the correct code can be output, and the haxe compiler can check your arguments in haxe.
static var add = Loader.load("addInts", "iii" ); ... assertTrue(add!=null); assertEquals(7, add.call(2,5));
If you try to pass a non-integer to the addInts function, the haxe compiler will complain. If the signature of addInts does not match, then you will get a run-time error when you load the dll. Notice also the “.call” to call the function. The need for this may be eliminated by haxe’s pseudo-operator-overloading-via-abstract.
The Loader macro is actually not that complicated, and it is possible to write your own version that uses an alternative to the java-style signature. Using the magic of haxe, it should also be able to infer then signature of the “addInts” from the “add” function.
Features from the unit tests should be safe to use in your own code.
Let me start by qualifying the title with the obligatory phrase in some situations. TL;DR – it was probably too slow to start with.
The performance story starts, like all good performance stories do, with a benchmark. I was looking for a benchmark to measure the allocation system, but also mix in a little calculation, in as few lines as possible – so I settled on the Mandelbrot program. But with a little “deoptimization” to use classes for the complex arithmetic rather than inlining the components. On a whim, I also decided to store the resulting image as an array of RGB classes, rather that an array of Int, which turned out to also have significant implications.
You can see the code in the haxe repo. You may note there are 2 versions controlled by a define, a class based one and an anonymous object (typedef) based one. The typedef option is for future optimization attempts. Also, the performance figures I will be talking about have the “SIZE” parameter increased to 40, to allow more accurate timing. The results are predictable, if not beautiful.
This benchmark is all about allocation and garbage collection (GC). The allocation part calls the object “new” routine to find some space on one of the memory blocks, and the collection part must scan the allocation space to reclaim object space that is no longer used. These phases have quite different optimization characteristics – the allocation part gets called literally millions of times, and therefore micro-optimizations at the code level may be appropriate, while the collection code get run less frequently, but must do quite a lot of work, so algorithmic optimizations might be more appropriate.
The initial results with hxcpp 3.2.102 were really quite bad. Hxcpp 3.2.102 takes approximately 35 seconds on a windows 32 build. Compare this to the node/js solution of 6.1 second. It should be noted that this test is measuring mostly allocations, and code base is small, so node can quite efficiently inline the code and use its highly-tuned single-thread allocation code. But still, it should be possible to improve things.
Before even looking at the low-level code, one thing that has a big effect on speed is the amount of free space available. The Gc allocates into its free memory, and when it’s full, it reclaims/collects the unused portions. The speed of the reclaiming depends mostly on the the amount of “live” memory after the collection. By increasing the amount of free memory, the time between collections increases, while keeping the collection time roughly the same, thereby improving the overall speed. Hxcpp uses a heuristic to determine the free space it should use depending on the “live” size at last collection. There was a fixed ratio in 3.2.103, but since then I have exposed the settings and increased the minimum working memory to 20M on desktop and 8M on mobile. This alone gives a nice improvement.
After creating a benchmark, the next step in optimization is to run a profiler. And the profiler did indeed show that most of the time was spent in the allocation code. It did not appear to be in one particular part of the code, but instead spread over the whole routine. This implies that I really need to get the line count down, and to that end I decided to simplify some the auxiliary data structures at the expense of memory. The hxcpp Gc uses conservative marking on the stack. That is, it scans the memory and looks for things that look like pointers to Gc objects. It then needs to know with 100% certainty that the thing pointed to is (or was recently) an actual object, otherwise the marking process could trash memory. The means that some extra information needs to be used to track this. Initially, I used what I thought was a clever trick to reuse the byte I was already using to indicate whether a line was marked as a head for a mini linked-list of live-objects. This worked and did not require extra memory, but the maintenance of the list complicated the allocation routine. I switched to using a extra bitmap, which uses 1 bit per 4 bytes of Gc memory to indicate whether an object starts at the particular location. This is 3% extra memory, but it greatly simplifies the allocation logic, allowing for a neater implementation.
The next point of simplification was pre-calculating the “gaps” in the Gc memory blocks. Initially, this was done in the allocation code, but by doing it in advance, it simplifies allocation code that is getting run millions of times. Theoretically, there are a similar number of operations going on here, but practically the separation makes the code much neater and therefore faster. Later, I moved this calculation to a separate thread so it can be done in parallel with the normal code.
I also rearranged the object header to allow the marking code to test whether an object needs marking with a single bit test, and therefore avoid an expensive marking function call in some circumstances. The header also includes a neater row count which makes the actual row marking faster.
These changes sped the allocations up considerably, but the profiler still showed most of the time running the allocation code. The main issue then became the function call depth. Modern c++ compilers can do a pretty good job at inlining code, but they are not magic, so some extra help is required. First a quick review of what happens when you call “new Complex” in haxe. Hxcpp generates a standard c++ “new” call, which calls into hx::Object “operator new” which is where the Gc memory allocation routine is called. C++ then initialises the object by inserting the vtable pointer and hxcpp then calls “__construct” on the object – which is the “new” function defined in haxe. It is split like this to allow the haxe code to call “super” in ways that c++ does not allow with its native constructors. The “operator new” code takes a parameter defining whether the resulting object contains pointers to other object – this allows the marking code to make some optimizations. In the original code, this then called an “InternalNew” function which checked the object size and delegated to either a thread-local allocator, or a large allocator. This allows independent threads to allocate at full speed without needing locks. The problem with this arrangement is that several function calls are required per allocation. The solution here was it inline quite a lot of code into the “operator new” function, and also to observe that haxe objects always allocate from the local allocator, so some some checks can be avoided. Inlining means that some of the guts of the allocator need to be exposed to the haxe code. However, this is kept under control by keeping the complex case (when there is not enough room) in the separate file, and only inlining the “easy” case, which is also helped by the simplifications made earlier. You can see the code here.
There are still a few things that could be done. There are some checks that could be avoided in single threaded app. Another idea I had was to move some of the housekeeping (setting of the header and start bits) to a separate thread to be done later. There are still 2 function calls going on here – the “operator new” (which gets inlined) and the “__construct” call (which does not). It should be possible to inline the construct call if it is simple enough.
These were the changes to the allocation code. As mentioned earlier, there were some improvements to the marking code. Additional things were also done in the collector. The reclaimed memory is cleared in a background thread. This is actually is quite a nice improvement because up to 7% of the time was spent doing the clear. One thing this particular benchmark highlighted was the marking of a single large array (the image array) actually got slower when using multi-threaded marking, due to a slight overhead. I’m not sure how applicable this is to general code, however there was a solution. The multi-threaded array marking code now splits large array marking up if there are idle marking threads.
So with all these improvements, did it beat the node/js speed? Just about – it was a close thing. One more change was really required – compile a 64 bit target instead of a 32 bit target. Then finally the time came in at 5.6 seconds, less than 1/6 the original time. Node is doing a pretty awesome job here and I’m sure it is only going to get better so it will take some work to keep up. If I change the benchmark to use member functions instead of static-inline functions, then hxcpp becomes significantly faster than node, which is nice to know. It should also be noted that I’m cheating a bit here by using multiple threads to get the job done. But I’m not above a bit of cheating.
This brings me to the Java target – 2 seconds. Wow. I’m not actually sure if it is doing any allocations here at all – maybe it has completely inlined the code. Or maybe it just really really good. That sets a seriously high benchmark
Is there still room for improvement? I think there is – particularly, a generational Gc would reduce the marking time to practically nothing in this case, but it requires a bit of effort on the haxe compiler side. Inlining the constructor for simple classes (like the “Complex” class in this code) should also be a nice win.
It has been an interesting experience getting the code faster. I think that it highlights one key thing – if you want the hxcpp code to run faster, then create a simple benchmark.
It is possible to use the hxcpp build tool to compile linux 32/64 binaries from a Mac Box. “Why would you want to do this?” I hear you ask. Well, I use this so I can run all my builds from a mac-mini, with windows running in VirtualBox. I know there are other ways, but having a single little box doing all the work is pretty easy to manage.
To do this, you just need to setup the HXCPP_XLINUX variables. There are a number of ways you can do this, but the ~/.hxcpp_config.xml file is probably the cleanest place.
Having installed the cross-compilers from http://crossgcc.rts-software.org/doku.php?id=compiling_for_linux, is it just a matter of pointing to the executables using some xml entries:
<set name="HXCPP_XLINUX64_CXX" value="/usr/local/gcc-4.8.1-for-linux64/bin/x86_64-pc-linux-g++" />
<set name="HXCPP_XLINUX64_STRIP" value="/usr/local/gcc-4.8.1-for-linux64/bin/x86_64-pc-linux-strip" />
<set name="HXCPP_XLINUX64_RANLIB" value="/usr/local/gcc-4.8.1-for-linux64/bin/x86_64-pc-linux-ranlib" />
<set name="HXCPP_XLINUX64_AR" value="/usr/local/gcc-4.8.1-for-linux64/bin/x86_64-pc-linux-ar" />
<set name="HXCPP_XLINUX32_CXX" value="/usr/local/gcc-4.8.1-for-linux32/bin/i586-pc-linux-g++" />
<set name="HXCPP_XLINUX32_STRIP" value="/usr/local/gcc-4.8.1-for-linux32/bin/i586-pc-linux-strip" />
<set name="HXCPP_XLINUX32_RANLIB" value="/usr/local/gcc-4.8.1-for-linux32/bin/i586-pc-linux-ranlib" />
<set name="HXCPP_XLINUX32_AR" value="/usr/local/gcc-4.8.1-for-linux32/bin/i586-pc-linux-ar" />
Then you can build from a “build.n” (such as in $HXCPP/project) file using:
neko build.n linux
As seen at the wwx2015, there are now many great frameworks you can use with haxe. I would like to outline what I believe are the core values of NME to help you, the developer, choose a framework for future developments, or entice you into trying out NME with your existing project.
For existing haxe users, NME is super easy to get started with. First, install the latest release with “haxelib install nme”, then create an alias for running the nme tool with “haxelib run nme setup”. Now you are setup with the “nme” command. If you do not want to create the alias, you can simply use “haxelib run nme” in wherever the “nme” command is used in the following discussion.
You can run the demos without needing to actually clone any demo code, but you probably want to start in a clean directory since the intermediate and application files will be created in a subdirectory of the working directory. Use “nme demo” for a list of NME demos. This list includes the first few letters in square brackets – this shows the shortcut you can use run the demo. eg, “nme demo ca” will run the “Camera” demo, with the default target “neko”.
NME is still largely compatible with Openfl projects, including flixel projects. And, it can make guesses about demos from other haxelibs, eg: “nme demo flixel” will list the flixel demos (assuming you have them installed), and “nme demo flixel:mode” will run the demo with neko, or “nme demo flixel:neon cpp” will compile + run a demo with hxcpp.
If you have an existing Openfl project, you can simple try nme by running “nme” or “nme cpp” in your project directory.
If you want to start a simple test from scratch, you do not even need a project file. You can start with a single haxe file, like you might with flash. eg, start by typing in Test.hx:
class Test extends nme.display.Sprite { public function new() { super(); graphics.beginFill(0xff0000); graphics.drawCircle(100,100,100); } }
then just run “nme”.
One of the big innovations for this year is the cppia target and the Acadnme runtime.
This scripting target has the advantages of the neko target – being lighting fast to compile – combined with the versatility of the cpp target – that is, run on almost any device.
Acadnme comes precompiled for desktops targets (currently, Windows and Mac) in the acadnme haxelib (haxelib install acadnme), and for Android via the Google Play store. Once installed, you can test your program with cppia with something like “nme demo flixel:neon cppia”. This particular demo has some trouble with neko due to Floats initializing to null, whereas the cppia script follow the hxcpp quirks extremely closely and initialize these variables (more usefully) to zero.
To test on Android you need to be on the same wireless network. First, install and run the app and take note of the ip address displayed on the startup page. Then provide this to the nme tool as the “deploy” variable, like: “nme demo Stage cppia deploy=192.168.1.21”. Your app obviously needs to be “Android ready” for this to work well, that is, ready for touch input and screen scaling.
You can create your own cppia target to use instead of acadnme, with “nme set cppiaHost /path/to/your/host”. Have a look at the opensource acadnme project to get an idea of what might be involved.
I understand that there is not a one-size-fits-all approach for app frameworks. Other projects place higher values on things such as console targets, modularity of multiple libraries, re-architecting to get things 100% right and either avoiding or embracing the flash Api. NME’s core values are stability, innovation and pragmatism. If you also value these things, then we would love to have you on the team. If you prefer another framework, you can still have NME installed to use for testing and comparison, or for quick scripting via Acadnme.
It is very easy try out NME to see if it suits you needs. See how you like it, and tweet me @gamehaxe.
Here are the slides from my wwx2015 talk.
You can compile them yourself with:
haxelib install nme haxelib install gm2d haxelib install acadnme haxelib run nme demo gm2d:11 cppia
On linux, you may want to compile for neko or cpp, instead of cppia.
Just got back from the haxe conference, wwx2014, after having a great time. Plenty of fantastic people to talk to.
The talk consists of some slides, with a demo at the end. To run the demo, you need to run the binary version on either mac or windows, but the slides can be viewed with flash.
Or, you can build it yourself for you preferred platform and explore the source code with:
haxelib install nme haxelib install gm2d haxelib run nme demo gm2d 10-wwx2014 cpp
Edit – must mention that you might need an dev version of haxe to build the demo.