Some recent benchmarks have indicated possible lock contention on the
leaf-level rcu_node locks. This commit therefore limits the number of
CPUs per leaf-level rcu_node structure to 16, in other words, there
can be at most 16 rcu_data structures fanning into a given rcu_node
structure. Prior to this, the limit was 32 on 32-bit systems and 64 on
64-bit systems.
Note that the fanout of non-leaf rcu_node structures is unchanged. The
organization of accesses to the rcu_node tree is such that references
to non-leaf rcu_node structures are much less frequent than to the
leaf structures.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Use the CPU's bit in rnp->qsmask to determine whether or not the CPU
should try to report a quiescent state. Handle overflow in the check
for rdp->gpnum having fallen behind.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When a CPU that was in an extended quiescent state wakes
up and catches up with grace periods that remote CPUs
completed on its behalf, we update the completed field
but not the gpnum that keeps a stale value of a backward
grace period ID.
Later, note_new_gpnum() will interpret the shift between
the local CPU and the node grace period ID as some new grace
period to handle and will then start to hunt quiescent state.
But if every grace periods have already been completed, this
interpretation becomes broken. And we'll be stuck in clusters
of spurious softirqs because rcu_report_qs_rdp() will make
this broken state run into infinite loop.
The solution, as suggested by Lai Jiangshan, is to ensure that
the gpnum and completed fields are well synchronized when we catch
up with completed grace periods on their behalf by other cpus.
This way we won't start noting spurious new grace periods.
Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When a CPU is idle and others CPUs handled its extended
quiescent state to complete grace periods on its behalf,
it will catch up with completed grace periods numbers
when it wakes up.
But at this point there might be no more grace period to
complete, but still the woken CPU always keeps its stale
qs_pending value and will then continue to chase quiescent
states even if its not needed anymore.
This results in clusters of spurious softirqs until a new
real grace period is started. Because if we continue to
chase quiescent states but we have completed every grace
periods, rcu_report_qs_rdp() is puzzled and makes that
state run into infinite loops.
As suggested by Lai Jiangshan, just reset qs_pending if
someone completed every grace periods on our behalf.
Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The fix in commit #6a0cc49 requires more than three concurrent instances
of synchronize_sched_expedited() before batching is possible. This
patch uses a ticket-counter-like approach that is also not unrelated to
Lai Jiangshan's Ring RCU to allow sharing of expedited grace periods even
when there are only two concurrent instances of synchronize_sched_expedited().
This commit builds on Tejun's original posting, which may be found at
http://lkml.org/lkml/2010/11/9/204, adding memory barriers, avoiding
overflow of signed integers (other than via atomic_t), and fixing the
detection of batching.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The synchronize_srcu_expedited() function is currently quick if there
are no active readers, but will delay a full jiffy if there are any.
If these readers leave their SRCU read-side critical sections quickly,
this is way too long to wait. So this commit first waits ten microseconds,
and only then falls back to jiffy-at-a-time waiting.
Reported-by: Avi Kivity <avi@redhat.com>
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Tested-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The new (early 2010) implementation of synchronize_sched_expedited() uses
try_stop_cpu() to force a context switch on every CPU. It also permits
concurrent calls to synchronize_sched_expedited() to share a single call
to try_stop_cpu() through use of an atomically incremented
synchronize_sched_expedited_count variable. Unfortunately, this is
subject to failure as follows:
o Task A invokes synchronize_sched_expedited(), try_stop_cpus()
succeeds, but Task A is preempted before getting to the atomic
increment of synchronize_sched_expedited_count.
o Task B also invokes synchronize_sched_expedited(), with exactly
the same outcome as Task A.
o Task C also invokes synchronize_sched_expedited(), again with
exactly the same outcome as Tasks A and B.
o Task D also invokes synchronize_sched_expedited(), but only
gets as far as acquiring the mutex within try_stop_cpus()
before being preempted, interrupted, or otherwise delayed.
o Task E also invokes synchronize_sched_expedited(), but only
gets to the snapshotting of synchronize_sched_expedited_count.
o Tasks A, B, and C all increment synchronize_sched_expedited_count.
o Task E fails to get the mutex, so checks the new value
of synchronize_sched_expedited_count. It finds that the
value has increased, so (wrongly) assumes that its work
has been done, returning despite there having been no
expedited grace period since it began.
The solution is to have the lowest-numbered CPU atomically increment
the synchronize_sched_expedited_count variable within the
synchronize_sched_expedited_cpu_stop() function, which is under
the protection of the mutex acquired by try_stop_cpus(). However, this
also requires that piggybacking tasks wait for three rather than two
instances of try_stop_cpu(), because we cannot control the order in
which the per-CPU callback function occur.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Lai's RCU-callback immediate-adoption patch changes the RCU tracing
output, so update tracing.txt. Also update a few comments to clarify
the synchronization design.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When we handle the CPU_DYING notifier, the whole system is stopped except
for the current CPU. We therefore need no synchronization with the other
CPUs. This allows us to move any orphaned RCU callbacks directly to the
list of any online CPU without needing to run them through the global
orphan lists. These global orphan lists can therefore be dispensed with.
This commit makes thes changes, though currently victimizes CPU 0 @@@.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The first version of synchronize_sched_expedited() used the migration
code in the scheduler, and was therefore implemented in kernel/sched.c.
However, the more recent version of this code no longer uses the
migration code, so this commit moves it to the main RCU source files.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The TREE_RCU tracing had obsolete rcuclassic_trace_init() and
rcuclassic_trace_cleanup() function names. This commit brings them
up to date: rcutree_trace_init() and rcutree_trace_cleanup(),
respectively.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
RCU priority boosting's tracing did not distinguish between ongoing
boosting and completion of boosting. This commit therefore adds this
capability.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Add tracing for the tiny RCU implementations, including statistics on
boosting in the case of TINY_PREEMPT_RCU and RCU_BOOST.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Add priority boosting, but only for TINY_PREEMPT_RCU. This is enabled
by the default-off RCU_BOOST kernel parameter. The priority to which to
boost preempted RCU readers is controlled by the RCU_BOOST_PRIO kernel
parameter (defaulting to real-time priority 1) and the time to wait
before boosting the readers blocking a given grace period is controlled
by the RCU_BOOST_DELAY kernel parameter (defaulting to 500 milliseconds).
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If RCU priority boosting is to be meaningful, callback invocation must
be boosted in addition to preempted RCU readers. Otherwise, in presence
of CPU real-time threads, the grace period ends, but the callbacks don't
get invoked. If the callbacks don't get invoked, the associated memory
doesn't get freed, so the system is still subject to OOM.
But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
moves the callback invocations to a kthread, which can be boosted easily.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Add an optional test to force long-term preemption of RCU read-side
critical sections, controlled by new test_boost, test_boost_interval,
and test_boost_duration module parameters. This is to be used to
test RCU priority boosting.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This addresses the following RCU lockdep splat:
[0.051203] CPU0: AMD QEMU Virtual CPU version 0.12.4 stepping 03
[0.052999] lockdep: fixing up alternatives.
[0.054105]
[0.054106] ===================================================
[0.054999] [ INFO: suspicious rcu_dereference_check() usage. ]
[0.054999] ---------------------------------------------------
[0.054999] kernel/sched.c:616 invoked rcu_dereference_check() without protection!
[0.054999]
[0.054999] other info that might help us debug this:
[0.054999]
[0.054999]
[0.054999] rcu_scheduler_active = 1, debug_locks = 1
[0.054999] 3 locks held by swapper/1:
[0.054999] #0: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff814be933>] cpu_up+0x42/0x6a
[0.054999] #1: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff810400d8>] cpu_hotplug_begin+0x2a/0x51
[0.054999] #2: (&rq->lock){-.-...}, at: [<ffffffff814be2f7>] init_idle+0x2f/0x113
[0.054999]
[0.054999] stack backtrace:
[0.054999] Pid: 1, comm: swapper Not tainted 2.6.35 #1
[0.054999] Call Trace:
[0.054999] [<ffffffff81068054>] lockdep_rcu_dereference+0x9b/0xa3
[0.054999] [<ffffffff810325c3>] task_group+0x7b/0x8a
[0.054999] [<ffffffff810325e5>] set_task_rq+0x13/0x40
[0.054999] [<ffffffff814be39a>] init_idle+0xd2/0x113
[0.054999] [<ffffffff814be78a>] fork_idle+0xb8/0xc7
[0.054999] [<ffffffff81068717>] ? mark_held_locks+0x4d/0x6b
[0.054999] [<ffffffff814bcebd>] do_fork_idle+0x17/0x2b
[0.054999] [<ffffffff814bc89b>] native_cpu_up+0x1c1/0x724
[0.054999] [<ffffffff814bcea6>] ? do_fork_idle+0x0/0x2b
[0.054999] [<ffffffff814be876>] _cpu_up+0xac/0x127
[0.054999] [<ffffffff814be946>] cpu_up+0x55/0x6a
[0.054999] [<ffffffff81ab562a>] kernel_init+0xe1/0x1ff
[0.054999] [<ffffffff81003854>] kernel_thread_helper+0x4/0x10
[0.054999] [<ffffffff814c353c>] ? restore_args+0x0/0x30
[0.054999] [<ffffffff81ab5549>] ? kernel_init+0x0/0x1ff
[0.054999] [<ffffffff81003850>] ? kernel_thread_helper+0x0/0x10
[0.056074] Booting Node 0, Processors #1lockdep: fixing up alternatives.
[0.130045] #2lockdep: fixing up alternatives.
[0.203089] #3 Ok.
[0.275286] Brought up 4 CPUs
[0.276005] Total of 4 processors activated (16017.17 BogoMIPS).
The cgroup_subsys_state structures referenced by idle tasks are never
freed, because the idle tasks should be part of the root cgroup,
which is not removable.
The problem is that while we do in-fact hold rq->lock, the newly spawned
idle thread's cpu is not yet set to the correct cpu so the lockdep check
in task_group():
lockdep_is_held(&task_rq(p)->lock)
will fail.
But this is a chicken and egg problem. Setting the CPU's runqueue requires
that the CPU's runqueue already be set. ;-)
So insert an RCU read-side critical section to avoid the complaint.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Using ACCESS_ONCE() to observe the jiffies_stall/rnp->qsmask value
due to the caller didn't hold the root_rcu/rnp node's lock. Although
use without ACCESS_ONCE() is safe due to the value loaded being used
but once, the ACCESS_ONCE() is a good documentation aid -- the variables
are being loaded without the services of a lock.
Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>
CC: Dipankar Sarma <dipankar@in.ibm.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ===================================================
> [ INFO: suspicious rcu_dereference_check() usage. ]
> ---------------------------------------------------
> /home/greearb/git/linux.wireless-testing/kernel/sched.c:618 invoked rcu_dereference_check() without protection!
>
> other info that might help us debug this:
>
> rcu_scheduler_active = 1, debug_locks = 1
> 1 lock held by ifup/23517:
> #0: (&rq->lock){-.-.-.}, at: [<c042f782>] task_fork_fair+0x3b/0x108
>
> stack backtrace:
> Pid: 23517, comm: ifup Not tainted 2.6.36-rc6-wl+ #5
> Call Trace:
> [<c075e219>] ? printk+0xf/0x16
> [<c0455842>] lockdep_rcu_dereference+0x74/0x7d
> [<c0426854>] task_group+0x6d/0x79
> [<c042686e>] set_task_rq+0xe/0x57
> [<c042f79e>] task_fork_fair+0x57/0x108
> [<c042e965>] sched_fork+0x82/0xf9
> [<c04334b3>] copy_process+0x569/0xe8e
> [<c0433ef0>] do_fork+0x118/0x262
> [<c076302f>] ? do_page_fault+0x16a/0x2cf
> [<c044b80c>] ? up_read+0x16/0x2a
> [<c04085ae>] sys_clone+0x1b/0x20
> [<c04030a5>] ptregs_clone+0x15/0x30
> [<c0402f1c>] ? sysenter_do_call+0x12/0x38
Here a newly created task is having its runqueue assigned. The new task
is not yet on the tasklist, so cannot go away. This is therefore a false
positive, suppress with an RCU read-side critical section.
Reported-by: Ben Greear <greearb@candelatech.com
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Ben Greear <greearb@candelatech.com
As suggested by Linus, push the irqs_disabled() down to the
rcu_read_lock_bh_held() level so that all callers get the benefit
of the correct check.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
With all the recent module loading cleanups, we've minimized the code
that sits under module_mutex, fixing various deadlocks and making it
possible to do most of the module loading in parallel.
However, that whole conversion totally missed the rather obscure code
that adds a new module to the list for BUG() handling. That code was
doubly obscure because (a) the code itself lives in lib/bugs.c (for
dubious reasons) and (b) it gets called from the architecture-specific
"module_finalize()" rather than from generic code.
Calling it from arch-specific code makes no sense what-so-ever to begin
with, and is now actively wrong since that code isn't protected by the
module loading lock any more.
So this commit moves the "module_bug_{finalize,cleanup}()" calls away
from the arch-specific code, and into the generic code - and in the
process protects it with the module_mutex so that the list operations
are now safe.
Future fixups:
- move the module list handling code into kernel/module.c where it
belongs.
- get rid of 'module_bug_list' and just use the regular list of modules
(called 'modules' - imagine that) that we already create and maintain
for other reasons.
Reported-and-tested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Adrian Bunk <bunk@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kfifo_dma family of functions use sg_mark_end() on the last element in
their scatterlist. This forces use of a fresh scatterlist for each DMA
operation, which makes recycling a single scatterlist impossible.
Change the behavior of the kfifo_dma functions to match the usage of the
dma_map_sg function. This means that users must respect the returned
nents value. The sample code is updated to reflect the change.
This bug is trivial to cause: call kfifo_dma_in_prepare() such that it
prepares a scatterlist with a single entry comprising the whole fifo.
This is the case when you map the entirety of a newly created empty fifo.
This causes the setup_sgl() function to mark the first scatterlist entry
as the end of the chain, no matter what comes after it.
Afterwards, add and remove some data from the fifo such that another call
to kfifo_dma_in_prepare() will create two scatterlist entries. It returns
nents=2. However, due to the previous sg_mark_end() call, sg_is_last()
will now return true for the first scatterlist element. This causes the
sample code to print a single scatterlist element when it should print
two.
By removing the call to sg_mark_end(), we make the API as similar as
possible to the DMA mapping API. All users are required to respect the
returned nents.
Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
Cc: Stefani Seibold <stefani@seibold.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current tracing data is not sufficient to deduce the average time
that a callback spends waiting for a grace period to end. Add three
per-CPU counters recording the number of callbacks invoked (ci), the
number of callbacks orphaned (co), and the number of callbacks adopted
(ca). Given the existing callback queue length (ql), the average wait
time in absence of CPU hotplug operations is ql/ci. The units of wait
time will be in terms of the duration over which ci was measured.
In the presence of CPU hotplug operations, there is room for argument,
but ql/(ci-co+ca) won't steer you too far wrong.
Also fixes a typo called out by Lucas De Marchi <lucas.de.marchi@gmail.com>.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The CONFIG_DEBUG_LOCK_ALLOC ifdef isn't necessary at this point, because it is
checked in an outer ifdef level already and has no effect here.
Signed-off-by: Christian Dietrich <qy03fugy@stud.informatik.uni-erlangen.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The below bug in fork led to the rmap walk finding the parent huge-pmd
twice instead of just once, because the anon_vma_chain objects of the
child vma still point to the vma->vm_mm of the parent.
The patch fixes it by making the rmap walk accurate during fork. It's not
a big deal normally but it worth being accurate considering the cost is
the same.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix nohz balance kick
sched: Fix user time incorrectly accounted as system time on 32-bit
There's a situation where the nohz balancer will try to wake itself:
cpu-x is idle which is also ilb_cpu
got a scheduler tick during idle
and the nohz_kick_needed() in trigger_load_balance() checks for
rq_x->nr_running which might not be zero (because of someone waking a
task on this rq etc) and this leads to the situation of the cpu-x
sending a kick to itself.
And this can cause a lockup.
Avoid this by not marking ourself eligible for kicking.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1284400941.2684.19.camel@sbsiddha-MOBL3.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
compat_alloc_user_space() expects the caller to independently call
access_ok() to verify the returned area. A missing call could
introduce problems on some architectures.
This patch incorporates the access_ok() check into
compat_alloc_user_space() and also adds a sanity check on the length.
The existing compat_alloc_user_space() implementations are renamed
arch_compat_alloc_user_space() and are used as part of the
implementation of the new global function.
This patch assumes NULL will cause __get_user()/__put_user() to either
fail or access userspace on all architectures. This should be
followed by checking the return value of compat_access_user_space()
for NULL in the callers, at which time the access_ok() in the callers
can also be removed.
Reported-by: Ben Hawkes <hawkes@sota.gen.nz>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: James Bottomley <jejb@parisc-linux.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: <stable@kernel.org>
Mathieu reported bad latencies with make -j10 kind of kbuild
workloads - which is mostly caused by us scheduling with a
too coarse granularity.
Reduce the minimum granularity some more, to make sure we
can meet the latency target.
I got the following results (make -j10 kbuild load, average of 3
runs):
vanilla:
maximum latency: 38278.9 µs
average latency: 7730.1 µs
patched:
maximum latency: 22702.1 µs
average latency: 6684.8 µs
Mathieu also measured it:
|
| * wakeup-latency.c (SIGEV_THREAD) with make -j10
|
| - Mainline 2.6.35.2 kernel
|
| maximum latency: 45762.1 µs
| average latency: 7348.6 µs
|
| - With only Peter's smaller min_gran (shown below):
|
| maximum latency: 29100.6 µs
| average latency: 6684.1 µs
|
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <AANLkTi=8m4g01wZPacySoF7U0PevTNVgJoZZrHiUD-pN@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is a problem in hibernate_preallocate_memory() that it calls
preallocate_image_memory() with an argument that may be greater than
the total number of available non-highmem memory pages. If that's
the case, the OOM condition is guaranteed to trigger, which in turn
can cause significant slowdown to occur during hibernation.
To avoid that, make preallocate_image_memory() adjust its argument
before calling preallocate_image_pages(), so that the total number of
saveable non-highem pages left is not less than the minimum size of
a hibernation image. Change hibernate_preallocate_memory() to try to
allocate from highmem if the number of pages allocated by
preallocate_image_memory() is too low.
Modify free_unnecessary_pages() to take all possible memory
allocation patterns into account.
Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Tested-by: M. Vefa Bicakci <bicave@superonline.com>
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, tsc: Fix a preemption leak in restore_sched_clock_state()
sched: Move sched_avg_update() to update_cpu_load()
Correct some pr_debug() misuse and add a stronger parameter check to
pm_qos_write() for the ASCII hex value case. Thanks to Dan Carpenter
for pointing out the problem!
Signed-off-by: mark gross <markgross@thegnar.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Just got my 6 way machine to a state where cpu 0 is in an
endless loop within __smp_call_function_single.
All other cpus are idle.
The call trace on cpu 0 looks like this:
__smp_call_function_single
scheduler_tick
update_process_times
tick_sched_timer
__run_hrtimer
hrtimer_interrupt
clock_comparator_work
do_extint
ext_int_handler
----> timer irq
cpu_idle
__smp_call_function_single() got called from nohz_balancer_kick()
(inlined) with the remote cpu being 1, wait being 0 and the per
cpu variable remote_sched_softirq_cb (call_single_data) of the
current cpu (0).
Then it loops forever when it tries to grab the lock of the
call_single_data, since it is already locked and enqueued on cpu 0.
My theory how this could have happened: for some reason the
scheduler decided to call __smp_call_function_single() on it's own
cpu, and sends an IPI to itself. The interrupt stays pending
since IRQs are disabled. If then the hypervisor schedules the
cpu away it might happen that upon rescheduling both the IPI and
the timer IRQ are pending. If then interrupts are enabled again
it depends which one gets scheduled first.
If the timer interrupt gets delivered first we end up with the
local deadlock as seen in the calltrace above.
Let's make __smp_call_function_single() check if the target cpu is
the current cpu and execute the function immediately just like
smp_call_function_single does. That should prevent at least the
scenario described here.
It might also be that the scheduler is not supposed to call
__smp_call_function_single with the remote cpu being the current
cpu, but that is a different issue.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jens Axboe <jaxboe@fusionio.com>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20100910114729.GB2827@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: t_start: reset FTRACE_ITER_HASH in case of seek/pread
perf symbols: Fix multiple initialization of symbol system
perf: Fix CPU hotplug
perf, trace: Fix module leak
tracing/kprobe: Fix handling of C-unlike argument names
tracing/kprobes: Fix handling of argument names
perf probe: Fix handling of arguments names
perf probe: Fix return probe support
tracing/kprobe: Fix a memory leak in error case
tracing: Do not allow llseek to set_ftrace_filter
Be sure to avoid entering t_show() with FTRACE_ITER_HASH set without
having properly started the iterator to iterate the hash. This case is
degenerate and, as discovered by Robert Swiecki, can cause t_hash_show()
to misuse a pointer. This causes a NULL ptr deref with possible security
implications. Tracked as CVE-2010-3079.
Cc: Robert Swiecki <swiecki@google.com>
Cc: Eugene Teo <eugene@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Please revert 2.6.36-rc commit d2997b1042
"hibernation: freeze swap at hibernation". It complicated matters by
adding a second swap allocation path, just for hibernation; without in any
way fixing the issue that it was intended to address - page reclaim after
fixing the hibernation image might free swap from a page already imaged as
swapcache, letting its swap be reallocated to store a different page of
the image: resulting in data corruption if the imaged page were freed as
clean then swapped back in. Pages freed to si->swap_map were still in
danger of being reallocated by the alternative allocation path.
I guess it inadvertently fixed slow SSD swap allocation for hibernation,
as reported by Nigel Cunningham: by missing out the discards that occur on
the usual swap allocation path; but that was unintentional, and needs a
separate fix.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ondrej Zary <linux@rainbow-software.org>
Cc: Andrea Gelmini <andrea.gelmini@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gid_t is a unsigned int. If group_info contains a gid greater than
MAX_INT, groups_search() function may look on the wrong side of the search
tree.
This solves some unfair "permission denied" problems.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add cgroup_attach_task_all()
The existing cgroup_attach_task_current_cg() API is called by a thread to
attach another thread to all of its cgroups; this is unsuitable for cases
where a privileged task wants to attach itself to the cgroups of a less
privileged one, since the call must be made from the context of the target
task.
This patch adds a more generic cgroup_attach_task_all() API that allows
both the source task and to-be-moved task to be specified.
cgroup_attach_task_current_cg() becomes a specialization of the more
generic new function.
[menage@google.com: rewrote changelog]
[akpm@linux-foundation.org: address reviewer comments]
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Ben Blum <bblum@google.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The gcov-kernel infrastructure expects that each object file is loaded
only once. This may not be true, e.g. when loading multiple kernel
modules which are linked to the same object file. As a result, loading
such kernel modules will result in incorrect gcov results while unloading
will cause a null-pointer dereference.
This patch fixes these problems by changing the gcov-kernel infrastructure
so that multiple profiling data sets can be associated with one debugfs
entry. It applies to 2.6.36-rc1.
Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Reported-by: Werner Spies <werner.spies@thalesgroup.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently sched_avg_update() (which updates rt_avg stats in the rq)
is getting called from scale_rt_power() (in the load balance context)
which doesn't take rq->lock.
Fix it by moving the sched_avg_update() to more appropriate
update_cpu_load() where the CFS load gets updated as well.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1282596171.2694.3.camel@sbsiddha-MOBL3>
Signed-off-by: Ingo Molnar <mingo@elte.hu>