The easiest way I could see is a 4096 x 1 bit memory - that allows for up to 163us of delay.
A free-running write counter in the 25MHz domain is used to write the reference signal to the memory.
An offset behind the write counter is used to read the memory, and send to the output.
You have a phase comparator comparing the input and output, and use that to adjust the offset
on a clock_25MHz
ram[wr_addr] <= ref_input;
ref_output <= ref_input;
tracking_output <= ram[wr_addr - delay/64];
# Rising edge of ref clock
if ref_output = '0' and ref_input = '1' then
if tracking_output = 1 then
#we are leading, so increase the delay
delay++;
if tracking_output = '0' then
#we are lagging, so decrease the delay
delay--;
wr_addr++;
Note that such a thing will only work if both of the slow clocks come from the same reference - if not, eventually you will need to skip / insert an extra cycle to keep synced.
You could code it a lot more subtilly than this, but as a 0th approximation it should be workable.