Long Laser Project - Position Drift in X?

Well it’s been awhile since I’ve ran into a serious bug. After a few days of testing there is a repeatable drift in X to the left (X-). A total of 4 prints, with something different each time, have identical results.

The summary of this is I’m running out of ideas and don’t know what could cause this, and therefore I am having trouble correcting it. This only occurs after some time, and doing 5 tests takes 30 hours. I need some way to speed up this testing.

This topo path is not predictable, nothing like bottom to top or anything, but the dark lines are cut before the lighter ones. They do not line up on top of each other in some areas, and do in others.

Total burn time for this is around 6 hours, and is around 1M lines of code.

Very obvious at the right edge, with a 1mm shift.
image

I have tested gcode based in G91 and also G90, and the results are identical. I have tested 2 firmware versions and there’s no change.

Here’s a couple of highlights of actual, versus the rendered in ncviewer.
Thick: Line 65,000
Thin: Line 238,000
image
image

Here’s an example of one that’s fine.
Thick: Line 21,000
Thin: 525,000
image
image

Another one that’s off:
Thick: Line 7,500
Thin: Line 600,000
image
image

If anyone is interested in looking into this here’s the gcode.
mt hood-layout-sm-absolute.zip (3.9 MB)

I’m quite positive this is not a floating point rounding issue, as I did all of the floating point math in excel using both single and doubles and graphed the positional error for each line of gcode as it’s executed. This math was done on the G91 relative based gcode. Visually, the G90 and G91 results looked the same, and I don’t think this error explains anything:
image
image image
The X and Y components are roughly similar, which also doesn’t match with the result of primarily X shift and not Y.

So, I can’t really think of mechanism that would cause this. Any ideas would be appreciated.

It might be skipping steps. As I’ve mentioned elsewhere, I suspect the interrupt handling in the controller. That code isn’t well-written, certainly by what SM added, but also in the Marlin base, which wasn’t really written with an expansion of functionality in mind. One kind of failure is just not to send a step. Another would be to send a step late but the one after “on time”, meaning too soon after the first, causing it to miss.

It might also be skipping steps because of a hardware inadequacy. The Y-axis is redundant, so it’s likely more resilient to that, since one axis can assist the other on a marginally-positioned pulse. The X-axis is not redundant, so it could be more sensitive. It might also be a failure to brake because of inertia. As I recall, there are acceleration/deceleration profiles in the code; these might be too aggressive for the hardware. If it’s this, there would definitely be an X vs. Y difference.

Or, I should add, it could be some of each. Marginally-timed step commands and undersized motors to counteract inertia could cause this.

Except it’s not fine, it’s just not as bad as the others. The small heavy line at a peak is shifted to the left, and at least one of the light lines is shifted to the left as well.

If I were sufficiently motivated–I’m not, but you might be–there are a few things I can think of.

  • From what I’ve seen, the individual shapes are intact, and it’s their positioning with respect to each other that’s shifting. I’m assuming the G-code travels along one line at a time. Those motions will be slower than the traversals between lines. Fast travel makes a failure more likely. To verify this, you would process the image one line at a time. One check is to validate that the shape is intact, that it’s not skipping steps within a single line. A second check is to calculate an offset for each line. Certainly the X-axis is failing, but I’d write the image analysis code to compute a Y-axis offset as well.

  • A cheaper version of this, in the class of workarounds, is just to lower the speed of traversal on G0 commands. Then retry and see if it improves. Dropping it down a great deal might yield a good image, in which case you can use a bisection search to investigate where in this parameter space the problem arises.

  • Consider a data logger on the control cables. The enable, step, and direction pins could be buffered out and captured as really long waveforms. With some processing, these could be validated against the source. This would distinguish a great deal between firmware/controller problems and flaky hardware.

  • There are inexpensive DRO sensors out there that use the same sensor as inexpensive digital calipers. The main difference is that their signal go out over a cable to a display rather than being integrated with a display. The idea would be to take the stepper singles and validate them against DRO data. This could be done without logging with an Arduino or the like, or it could all be captured and analyzed later.

I appreciate the response and ideas. This is something I need to work through and solve because this has other ramifications for CNC projects I have in mind. I also don’t know at this point is this only my hardware or widespread.

As previously mentioned, I’m interested in optimizing testing, and doing more of a ‘binary search’ to reduce total testing time. Here’s a starter list of potential sources of error I think are worth testing in groups of changes. Exactly how to test these to minimize hidden variables is another thought process I haven’t gotten to yet.

Hardware:
Missed step pulses
Skipped steps due to fast travel speed
Skipped steps due acceleration

Software:
Internal position tracking (in floating point)
Step count tracking (integer)
Gcode parsing

I like coming up with test cases that are as simple as possible but still exhibit the behavior. I’m not sure this is related to the length of the file, or even the time. I recently completed a 20hour project very similar to this: about 4M lines of gcode to do grayscale engraving, followed at the end by a cutout pass - it was exact. I’m not sure what a simplified test case for this would be. I’d prefer to not have to run 30hour tests multiple times.

Sounds similar to my layer shifting issues.

Support is unsure about the source and is “doing some testing” but insist my modules are not broken.

It just seems pretty much like they skip sometimes. Perhaps it is firmware related, at least i hope so…

Consider a laser test pattern with two rulers, one near the +X side and the other on the -X side. Instead of doing it sensibly, one ruler at a time, burn one line on each ruler in alternation. That will mean traversing 250 mm or so each time. Burn the tick marks in the reverse direction of traverse to maximize the effect of changing direction. Manually set the G0 speed in G-code so that the traversal is fast in one direction and slow on the other. (I don’t know if this needs firmware modifications; they wouldn’t be hard). Most of the execution time should be spent traversing. If the problem is indeed skipped steps because of interrupt or mechanical problems, the ruler should drift only in one direction.

1 Like

I’ve done quite a few really long laser projects. Between 20 & 25 hours.
Also done quite a few long cnc projects - 12-15 hour tool paths with multiple bits.
Most of the time when clearing I have stock to leave set at 2mm for the first 1/4" bit and then 1mm and .5 as the bits get smaller.
I haven’t seen any problem with drift. The only exception to this is when I accidentally ran my bit into the workpiece when setting z and I didn’t return to home. As long as I’ve homed I’ve had excellent repeatablity. That’s even when I repeat the last pass with a .1 lower offset just to clean things up.
If it was doing this on mine I’d definitely see it but I haven’t.

-S

1 Like

Over the months I’ve been reading this forum I’ve seen various examples of G-code based tests. They all seem to be manually created handcrafted once-off, rather than parametric models that can be iterated.

In my day job much of what I do is massaging data and generating text processing tools to do the massaging.

If the kinds of iterations can be described in words, I suspect that I could knock up a few simple scripts that could generate G-code that iterates across various parameters.

If this helps the calibration and troubleshooting process, I’m happy to contribute.

Note that I cannot (yet) test them, since my SM is still enroute.

o

1 Like

I haven’t done laser projects yet, but have been playing a lot with the cnc recently (wall mounted tablet holder and that fits like a glove), complete processing time about 10 hours, including some cleaning passes up to 3000mm/s (I don’t like sanding ;)), so it does seem quite accurate Similar to the experience of @sdj544
(I might have another problem, but I won’t drag that into this thread)

This is a very long shot but, after looking at your gcode I noticed:

  • You have the backlash compensation enabled (M425)
  • It’s much bigger on the X-axis (0.11 vs 0.02)
  • I can’t really tell what the drift would be on the Y-axis because it’s cut up to the top and bottom

So, could this be the backlash compensation going wrong? Have you tried doing one of these jobs without it? As said, a long shot, more of an attempt to exclude as many variables as possible.

1 Like

It doesn’t sound like a long shot at all. Defects in code tend to hide in places that aren’t used as much as others; backlash compensation could easily be one of these. One kind of failure could be a failure of atomicity in keeping track of compensating movements, either recording a movement then getting interrupted and not doing it, or vice versa doing the movement and not recording it, so doing it again. I did a brief review of the code; all the backlash compensation is in the Marlin code base; I didn’t see any of the Snapmaker code touch it. Most of it’s in planner.cpp.

One thing worth noting. The Marlin code base has an odd option to do backlash compensation over a number of steps rather than a single one. The effect is to smooth the surface of a filament-printed object. It’s not what you’d want for a laser print, though. It looks as though it’s turned off, but if it’s not, it might cause some line shifting and a bit of distortion that would stretch or compress pieces of a line near a direction change. M425 reports the backlash correction parameters.

Well, you never know, I assume @brent113 will come back with some test results :slight_smile:

I did find this issue as well: https://github.com/MarlinFirmware/Marlin/issues/19478
It didn come to a conclusion but seems to suggest that itś skipping steps at high travel speeds. So the moves at 3000mm/m in the gcode could be causing this.
But they stopped replying to that issue.

It does “feel” more like a software bug than a hardware bug.

For my sanity I took a few days and haven’t done anything. I’ll start another test with the previous gcode, but without backlash compensation and slower fast travel speed (900mm/min), and report back.

3000mm/min is well in the machine’s wheelhouse, a very common 3dp speed. That better not be the cause.

I always use 3000 on mine without issue.
I’ve used laser @2500 for working speed in vector also.

-S

That way you are taking two variables out of the equation at once. My first suspect is the actual backlash compensation implementation. It also looks like that code got some changes that I am not sure have gotten into the SM fork.
(didn’t check at all)

I don’t think it’s the speed per se, more the backlash implementation code going wrong at the higher speeds.

But then again, let’s just see what your test results deliver. Maybe it has nothing to do with it.

I did try the laser myself for the first time yesterday. But that will be the last time for a while too. Need to get the my air outtake to a better level first, :joy:

Yes, I prefer to group variables into tests. It’s faster, generally, to binary search through a group, than linear searching. In this case, not a difference with only 2, but it’s my preference.

Looks like the issue has been isolated in the latest test. Thanks everyone for the suggestions, next steps are to bring the machine back up to the latest firmware and isolate and patch.
image

My next theory is the backlash compensation must not be a non-integer multiple of the magic number, which for this machine I believe is 0.04mm? More testing required…

2 Likes

Glad to hear you’re making progress!

Do you know how to derive how many full steps there are on this machine? I’ve seen conflicting information (or maybe it’s not conflicting, but I don’t understand).

400 steps/mm in the firmware.
2mm pitch leadscrew??
The machine seems to not respond to steps smaller than 0.02mm.
A thread I saw someone said the driver is microstepping at 1/16.
Some other work done identified the pins controlling the driver microstepping config actually was full stepping?

Not sure how to combine all this into a number. What’s the correct multiple for backlash compensation to not step in partial steps?

If 400 steps/mm is correct (it should be), then that would mean each step is .0025mm. If the machine isn’t responding below ~.02mm, is that implying 1/8 microstepping, and thus the leadscrew is actually 4mm pitch? So minimum motion should be roughly 0.02mm?

Several things I don’t understand about this. If each step is 0.0025mm, why does the machine not respond to 0.01mm motion commands every time - appears to move every other time. Is this a Marlin feature where small movements are ignored? Could that be the source of the issue?

I see there is a minimum segment time in Marlin - that probably isn’t related here, but might be related to another issue I’m having…

This is new for me too, so I’m just writing out my thoughts. When you’re referring to the pitch, do you mean the lead of the screw? But I would assume they are the same in this case. (What is the difference between pitch and lead when referring to a screw?). I don’t think that number actually matters for this reverse calculation (other than a sanity check)

I came to that number as well. But I’m not sure I’m following with the rest. So I’m writing it out to make sure we’re saying the same :slight_smile:

I assume the steps/mm are the microsteps and not full steps; correct?
the travel of one step is 0.0025mm as mentioned above.
1/8 microstepping thus means 0.02mm travelled per full step
with a standard stepper motor that does 1.8 degrees per step, we thus come to a lead on the screw of 4mm (the pitch could still be 2 if it has two starts, don’t know for sure if that’s the case).

I was browsing through the forum trying to find some pictures. Looking at this one: Linear Module defective and FW 1.8.0.0 Feedback - #6 by Edwin and trying to guess without a lot of reference (only reference I’m guessing about are the pcb’s that I assume have a more or les standard thickness of +/- 1.8mm)
Based on that picture, It does seem to have a lead larger than 2mm pitch. So a 4mm lead and 2mm pitch could actually be the case. (At this point I don’t have a compelling reason to open one of the modules to check :slight_smile: )

So we seem to come to the same numbers independently, but I don’t understand that statement? Can it only do full steps for some reason? because:

I would assume it can make movements that small?

Probably there are hardware limitations I’m not aware off (I’m a software guy, not a mechanical/electronics engineer)

In conclusion, I guess I’m mostly rephrasing the same questions you have :sweat_smile: But based on that picture it does seem plausible that the lead screw has a lead/pitch ratio of 2.

1 Like

A full step is 0.04mm. A single step (@1/16 microstepping) is 0.0025. It is a 1.8° Nema 17 syepper motor with an integrated leadscrew (pressed in and epoxied to the rotor) of 2mm pitch - 4 start which has a lead of 8mm. Add to that the low quality leadscrew which has a mechanical variation in lead over it’s entire length, combined with axial play in the stepper motor itself and you get a hobby level machine.

Now if you home the X, or Y for that matter, during your job, you don’t know which microstep triggered the inaccurate mechanical limit switch. This means you could be a couple of microsteps out of sync with the previous motion before the axis was homed. You would have to go to closed loop control to mitigate some of the inherent faults in the machine and maybe change out the leadscrews and nuts to a quality 2mm pitch single start Misumi brand screw.

tl;dr a full step is 0.04 mm and it is a hobby level machine with inaccurate components.

Yea, I guess I understand that, but not how that can cause backlash compensation to loose steps.

0.11mm would be 44 steps, which is an integer multiple of the microstep. I’ll change it to 0.12 just so it’s also an integer multiple of the full step, see if that does anything.

EDIT: 0.12 as the backlash still exhibited the error. So the problem is solely backlash, and not speed or anything else. Also integer multiples of 0.04 do not matter. Retrying with 0.02 backlash just for a test.