TSAN detected wrong use of the old deque-based internal queue. To avoid
unwanted/undetected mis-use the patch uses the thread-safe block_queue
data structure instead.
on highly loaded systems it can happen that the get_metrics() is called
twice within a few houndred milliseconds. Logging a warning in this
case isn't needed, so reduce to info.
on the other hand, 100ms might be to convervative. Patch also
lowers the smallest interval to 10ms
* Protect PHY SR signal management in a class
* Protect intra_freq_meas vector
* Protect cell and srate shared variables in thread-safe classes
* srsue,srsenb: include TSAN options header
* Protect ue_rnti_t and rnti scheduling windows behind thread-safe classes
* Protect access to state variable in sync_state
* Protect access to metrics configuration
* Protect access to is_pending_sr
* Protect access to UE prach worker
* Protect UE mux
* Avoid unlocking mutex twice
* Fix data races in RF/ZMQ
* Fix data races in intra_measure and PHY
* Fix minor data races in MAC
* Make TSAN default behaviour to not halt on error
* Fix blocking in intra cell measurement
* Address comments
Co-authored-by: Andre Puschmann <andre@softwareradiosystems.com>
Introduce a new macro to catch UHD exceptions and log them directly instead of storing an error string, similar to what errno does.
Remove usrp logging helpers that depend on the now removed member since all calls potentially log the error directly.
the EPS bearer manager was only informed when a single DRB
was removed but not when entering idle which requires to
remove all bearers.
This cause the service request to fail.
RFCI has detected this assert failing in the log_backend_test. I have not been able to reproduce this locally but my theory is the following one:
one of the unit tests does the following:
backend.start();
backend.stop();
the internal running_flag member could be set to true and then to false by the main thread before the worker thread calls do_work(). If this happens
the assert will be triggered, which is wrong and too conservative, so remove the assert.
this is a rather large commit that is hard to split because
it touches quite a few components.
It's a preparation patch for adding NR split bearers in the next
step.
We realized that managing RLC and PDCP bearers for both NR and LTE
in the same entity doesn't work. This is because we use the LCID
as a key for all accesses. With NR dual connectivity however we
can have the same LCID active at the same time for both LTE and NR
carriers.
The patch solves that by creating a dedicated NR instance for RLC/PDCP
in the stack. But then the question arises for UL traffic on, e.g. LCID 4
what PDCP instance the GW should use for pushing SDUs. It doesnt' know
that. And in fact it doesn't need to. It just needs to know EPS
bearer IDs. So the next change was to remove the knowledge of what
LCIDs are from the GW. Make is agnostic and only work on EPS bearer IDs.
The handling and mapping between EPS bearer IDs and LCIDs for LTE
or NR (mainly PDCP for pushing data) is done in the Stack because
it has access to both.
The NAS also has a EPS bearer map but only knows about default and
dedicated bearers. It doesn't know on which logical channels they
are transmitted.
this patch mainly modernizes the bearer creation to use smart pointers.
that allows to simplify the error handling.
ue_stack is changed to match new interface. This commit compiles
but doesn't work.
when a lost PDU is detected a warning will be logged. In theory
this could be info as well but a warning may help to detect issues
in tests. The same event causes multiple other warnings to be logged,
which is very spammy. The patch reduces the log level for
those messages to info.
the patch is a re-implementation of the customer-specific optimization
we did in order to reduce the time the RLC holds the Tx mutex when
processing an incoming status PDU.
The patch makes sure to never operate on a raw mutex but instead
uses the deadlock-avoiding RAII lock.
before processing incoming status PDUs we should be checking
if the ACK_SN falls within our current Tx window. If not the PDU
will be dropped.
Without the check we were incorrectly processing the status PDU
and because the sequence number wrap around wasn't working
correctly if ACK_SN is smaller than vt_a we were corrupting
our Tx window.
the test verifies that the ACK_SN of a status PDU falls inside the
rx_window of the receiver. If not, than the RLC state has been
corrupted and the status PDU is likely invalid.
we had it returning int but had a bug in using the return value properly,
i.e. handling when -1 was returned in RLC TM.
Thinking about it more, it doesn't make sense to have a negative return
value here anyway. Either the RLC can return a PDU or not. If it can't the
returned lenght is zero.
when a small grant is provided it might not be possible to fit a full status
PDU. This is currently detected while packing the PDU.
In order to avoid sending potentiall contradicting status info to the sending
entity, the fix makes sure to only transmit a small PDU acking what really
has been received so far.
This might not be optimal in terms for retx but will not corrupt any
state.
it turned out that a certain order of events can lead to
a RLC transmitter stalling because even though unacknowledged PDUs
are queued, none of them was actually considered for retx.
This can happen if a pollRetxTimer expires for a SN that, meanwhile,
has already been acknowledged. The positive lead to the deletion of
the SN from the Tx window.
The fix makes sure that when a retx for a unexisting SN is requested,
the sender will consider the next unacknowledged SN instead.
general-purpose method for lower layers to signal protocol
failures to higher layers, i.e. RRC.
In the current case, implement a direct release of the UE (enb) or
a reestablishment (UE).
TSAN doesn't work well then threads are created with attributes
thar require root rights but the process is run as normal user.
this patch avoid the thread attributes in this case. TSAN isn't going
to be used for production builds.
although the manual test with Amarisoft eNB worked fine it seems
the delay is still needed in the default case. Over 50% of the
tests failed in the nightly with:
[zmq] Error: tx time is 0.067 ms in the past (138240 < 139776)
[zmq] Error: tx time is 1.100 ms in the past (184320 < 209664)
While this usleep() should increase the pass likelihood it
still doesn't guarantee error-free runs, so we might need
to revisit it again as some stage.
the thread workers need access to their current state to exit properly
when they are set to state STOP. However, since the state is kept in
a std::vector for all workers, it seems more appropiate to add a per-thread
running variable rather then mutexing the entire vector.
- the consumer does multi-staged waiting:
1. spins first across all queues in a RR fashion
2. each queue access is done with a try_lock.
3. if the try_lock fails, it increases the number of spins needed
2. if no queue had data, the consumer sleeps for 100 usec.
- no differentiation between queues, in terms of notification features
it seems that different SNs should be retransmitted depending
on the actual situation. In case of pollRetx timer expiration,
vt_s - 1 should actually be resent.
This patch prepares the function to accept different SNs but
leaves it to send vt_a by default. The RLC AM test would need
to changed as well to not fail.
- Replace shared_variable members in log_channel class in favor of atomics.
- Remove the small string optimization in srslog now that we dont allocate anymore.
- Trim some critical sections in srslog.
also make sure we don't assign LCIDs beyond the possible
number.
possible fix for https://github.com/srsran/srsRAN/issues/658
Co-authored-by: herlesupreeth <herlesupreeth@gmail.com>
Co-authored-by: Francisco <francisco.paisana@softwareradiosystems.com>
- extend sched_dl_cqi interface to allow getting tti when cqi was last updated
- extend sched_dl_cqi to quickly get average cqi across the whole bandwidth